pmf-automl源码分析

2024-04-20 23:38

文章标签 分析源码 automl pmf

本文主要是介绍pmf-automl源码分析，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

arxiv论文（有附录，但是字小）
Probabilistic Matrix Factorization for Automated Machine Learning
NIPS2018论文（字大但是没有附录）
Probabilistic Matrix Factorization for Automated Machine Learning
代码
https://github.com/rsheth80/pmf-automl

文章目录

初窥项目文件
PMF模型训练
- 数据切分
- 初始隐变量
- 模型的定义与训练
- D个高斯过程的定义
- 后验分布协方差矩阵的求解
- - transform_forward与transform_backward函数
  - get_cov函数的顶层设计
  - kernel的RBF
  - kernel的White
  - 求协方差矩阵复盘
- GP前向函数的返回值的含义

初窥项目文件

用jupyter lab打开all_normalized_accuracy_with_pipelineID.csv
在这里插入图片描述

all_normalized_accuracy_with_pipelineID.zip contains the performance observations from running 42K pipelines on 553 OpenML datasets. The task was classification and the performance metric was balanced accuracy. Unzip prior to running code.

行表示pipeline id，列表示dataset id，元素表示balanced accuracy 。

在这里插入图片描述
简单查阅了一下pipelines.json，基本只有pca和polynomial两种preprocessor。

PMF模型训练

数据切分

Ytrain, Ytest, Ftrain, Ftest = get_data()

>>> Ytrain.shape
Out[2]: (42000, 464)
>>> Ytest.shape
Out[3]: (42000, 89)
>>> Ftrain.shape
Out[4]: (464, 46)
>>> Ftest.shape
Out[5]: (89, 46)

训练测试集切分，89个数据集作为测试集，464个训练集

初始隐变量

    imp = sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='mean')X = sklearn.decomposition.PCA(Q).fit_transform(imp.fit(Ytrain).transform(Ytrain))

>>> X.shape
Out[7]: (42000, 20)

根据目前的理解，整个训练过程就是根据GP来训练X的隐变量。这个隐变量是用PCA初始化的。

处理训练集的缺失值，并降维为20维（42K个pipelines，数据集从553降为20个隐变量）

论文：the elements of $Y$ are given by as nonlinear function of the latent variables, $y_{n,d}=f_d(x_n)+\epsilon$ , where $\epsilon$ is independent Gaussian noise.

这里的 $Y$ 指的是整个 $42000\times464$ 矩阵，那么 $X$ 就是pipeline空间的隐变量，这里隐变量维度 $Q = 20$ ， $X$ 的shape为 $42000\times20$

模型的定义与训练

模型的顶层定义：

    kernel = kernels.Add(kernels.RBF(Q, lengthscale=None), kernels.White(Q))m = gplvm.GPLVM(Q, X, Ytrain, kernel, N_max=N_max, D_max=batch_size)optimizer = torch.optim.SGD(m.parameters(), lr=lr)m = train(m, optimizer, f_callback=f_callback, f_stop=f_stop)

f_callback和f_stop都是两个local函数

    def f_callback(m, v, it, t):varn_list.append(transform_forward(m.variance).item())logpr_list.append(m().item()/m.D)if it == 1:t_list.append(t)else:t_list.append(t_list[-1] + t)if save_checkpoint and not (it % checkpoint_period):torch.save(m.state_dict(), fn_checkpoint + '_it%d.pt' % it)print('it=%d, f=%g, varn=%g, t: %g'% (it, logpr_list[-1], transform_forward(m.variance), t_list[-1]))

    def f_stop(m, v, it, t):if it >= maxiter-1:print('maxiter (%d) reached' % maxiter)return Truereturn False

看到训练函数train

def train(m, optimizer, f_callback=None, f_stop=None):it = 0while True:try:t = time.time()optimizer.zero_grad()nll = m()nll.backward()optimizer.step()it += 1t = time.time() - tif f_callback is not None:f_callback(m, nll, it, t)# f_stop should not be a substantial portion of total iteration timeif f_stop is not None and f_stop(m, nll, it, t):breakexcept KeyboardInterrupt:breakreturn m