LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS 论文详解

2024-01-09 22:10

本文主要是介绍LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS 论文详解,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

这篇文章网上找了发现并没有人贴过心得之类的文章,所以就权当翻译了,简单记下阅读过程,如果有误还请私信我,谢谢!

LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS

作者:Aaron Klein, Stefan Falkner, Jost Tobias Springenberg & Frank Hutter
Department of Computer Science

ABSTRACT

Different neural network architectures, hyperparameters and training protocols lead to different performances as a function of time. Human experts routinely inspect the resulting learning curves to quickly terminate runs with poor hyperparameter settings and thereby considerably speed up manual hyperparameter optimization. The same information can be exploited in automatic hyperparameter optimization by means of a probabilistic model of learning curves across hyperparameter settings. Here, we study the use of Bayesian neural networks for this purpose and improve their performance by a specialized learning curve layer.

不同的网络构架,超参数(可以理解为卷积核等参数)和训练方法造成了了不同的效果。所以专家们希望能够通过训练一个回归模型能够快速中断刁训练结果不好的模型从而加速训练过程。本文研究一个贝叶斯神经网络模型来实现这一目的,并且通过一个专门的学习曲线层来加强原有模型。

1 INTRODUCTION

Deep learning has celebrated many successes, but its performance relies crucially on good hyperparameter settings. Bayesian optimization (e.g, Brochu et al. (2010); Snoek et al. (2012); Shahriari et al. (2016)) is a powerful method for optimizing the hyperparameters of deep neural networks (DNNs). However, its traditional treatment of DNN performance as a black box poses fundamental limitations for large and computationally expensive data sets, for which training a single model can take weeks. Human experts go beyond this blackbox notion in their manual tuning and exploit cheaper signals about which hyperparameter settings work well: they estimate overall performance based on runs using subsets of the data or initial short runs to weed out bad parameter settings; armed with these tricks, human experts can often outperform Bayesian optimization.

深度学习太依赖于超参数的设置。贝叶斯模型可以很好的优化深度神经网络的参数。但是传统的、作为一个黑盒一样使用贝叶斯模型限制了大数据集的表现而且耗时极长。所以专家们就用原始集合的子集或者开始很短一段的训练效果来预测当前超参数情况下训练结果的好坏来快速清除不好的参数。

Recent extensions of Bayesian optimization and multi-armed bandits therefore also drop the limiting blackbox assumption and exploit the performance of short runs (Swersky et al., 2014; Domhan et al., 2015; Li et al., 2017), performance on small subsets of the data (Klein et al., 2017), and performance
on other, related data sets (Swersky et al., 2013; Feurer et al., 2015).

上述是一些分别探索这两类做法的文章和团队。

While traditional solutions for scalable Bayesian optimization include approximate Gaussian process models (e.g., Hutter et al.; Swersky et al. (2014)) and random forests (Hutter et al., 2011), a recent trend is to exploit the flexible model class of neural networks for this purpose (Snoek et al., 2015; Springenberg et al., 2016). In this paper, we study this model class for the prediction of learning curves. Our contributions in this paper are:

传统的贝叶斯优化用的是高斯模型、随机森林算法
本文的探索主要由以下几个方面构成:

  1. We study how well Bayesian neural networks can fit learning curves for various architectures and hyperparameter settings, and how reliable their uncertainty estimates are.
  2. Building on the parametric learning curve models of Domhan et al. (2015), we develop a specialized neural network architecture with a learning curve layer that improves learning curve predictions.
  3. We compare different ways to generate Bayesian neural networks: probabilistic back propagation (Hernández-Lobato and Adams, 2015) and two different stochastic gradient based Markov Chain Monte Carlo (MCMC) methods – stochastic gradient Langevin dynamics (SGLD (Welling and Teh, 2011)) and stochastic gradient Hamiltonian MCMC (SGHMC (Chen et al., 2014)) – for standard Bayesian neural networks and our specialized architecture and show that SGHMC yields better uncertainty estimates.
  4. We evaluate the predictive quality for both completely new learning curves and for extrapolating partially-observed curves, showing better performance than the parametric function approach by Domhan et al. (2015) at stages were learning curves have not yet converged.
  5. We extend the recent multi-armed bandit strategy Hyperband (Li et al., 2017) by sampling using our model rather than uniformly at random, thereby enabling it to approach nearoptimal configurations faster than traditional Bayesian optimization.
    在这里插入图片描述

这几方面分别是:

  1. 探究了贝叶斯神经网络能在多大限度上预测不同网络结构和参数设置的学习曲线,预估其可靠性。
  2. 基于Domhan et al.(2015)的成果,在原有网络结构的基础上添加了一个专门的学习曲线层来提高曲线的预测精度。
  3. 通过多种方式来生成贝叶斯神经网络:概率的反向传播和随机梯度算法。基于的理论原型见上文英文原文。同时也表明SGHMC的模型在不确定性预估上表现最好。
  4. 评估了完全新建的学习曲线和外推的部分观测的曲线。
  5. 改良了多臂赌博机模型。通过采样本文的模型而不是完全随机生成。

2 PROBABILISTIC PREDICTION OF LEARNING CURVES

In this section, we describe a general framework to model learning curves of iterative machine learning methods. We first describe the approach by Domhan et al. (2015) which we will dub LCExtrapolation from here on. Afterwards, we discuss a more general joint model across time steps and hyperparameter values that can exploit similarities between hyperparameter configurations and predict for unobserved learning curves. We also study the observation noise of different hyperparameter configurations and show how we can adapt our model to capture this noise.

这一部分描述了迭代机器学习的大体框架。首先是

T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Q. Yang and M. Wooldridge, editors, Proc. of IJCAI’15, 2015.

建立了一个更一般的随时间步和超参数值改变的模型。同时,也研究了不同超参数下的观测噪声以使模型适应捕获这些干扰。

2.1 LEARNING CURVE PREDICTION WITH BASIS FUNCTION

在这里插入图片描述

一个直观的模型使使用一组k个不同参数函数\phi来外推学习曲线y。每个参数函数\phi都是关于时间和参数向量的函数并且公国权重线性组合起来。

在这里插入图片描述

这一通公式还是自己看吧,我也看不太懂担心误导。

For our experiments, we use the original implementation by Domhan et al. (2015) with one modification: the original code included a term in the likelihood that enforced the prediction at t = T to be strictly greater than the last value of that particular curve. This biases the estimation to never underestimate the accuracy at the asymptote. We found that in some of our benchmarks, this led to instabilities, especially with very noisy learning curves. Removing it cured that problem, and we did not observe any performance degradation on any of the other benchmarks.
The ability to include arbitrary parametric functions makes this model very flexible, and Domhan et al. (2015) used it successfully to terminate evaluations of poorly-performing hyperparameters early for various different architectures of neural networks (thereby speeding up Bayesian optimization by a factor of two). However, the model’s major disadvantage is that it does not use previously evaluated hyperparameters at all and therefore can only make useful predictions after observing a substantial initial fraction of the learning curve.

在本文的实验中,对于原来的模型只有一个修改,强制在t= T时刻的预测值严格大于特定曲线的最后一个值。但是这七大的不稳定性,尤其是对于噪声干扰特别多的学习曲线。
这种包含算术函数的方法让模型更加灵活,而且已经成功发挥作用。但是,这个模型最大的问题是完全不借鉴先前的超参数所以只能在观测到大量的初始训练结果才能预测(我的理解是这种方式不能通过先前的结果预测,而是必须开始训练才能知道结果,比如说我通过先前的训练知道两个每个元素都大于0的向量相加结果每个元素一定大于0,但是却不会借鉴这个结果,而是再计算开始的几个,发现都大于0然后才预测;只是类比形式,不类比其他的,别被误导了 )。

2.2 LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS

In practice, similar hyperparameter configurations often lead to similar learning curves, and modelling
this dependence would allow predicting learning curves for new configurations without the need
to observe their initial performance. Swersky et al. (2014) followed this approach based on an approximate Gaussian process model. Their Freeze-Thaw method showed promising results for finding good hyperparameters of iterative machine learning algorithms using learning curve prediction to allocate most resources for well-performing configurations during the optimization. The method introduces a special covariance function corresponding to exponentially decaying functions to model the learning curves. This results in an analytically tractable model, but using different functions to account for cases where the learning curves do not converge exponentially is not trivial.

在实验中,相似的超参数会导致相似的学习曲线,对它们的独立性建模可是让曲线在不需要观测他们开始一些点训练情况下就可以进行预测。Swersky et al.2014就是跟着这个思路在高斯分布模型的基础上进行研究。他们的Freeze-Thaw方法就展示了在用学习曲线找到迭代机器学习算法的超参数方面的可靠表现。他们分配了绝大多数的资源用于良好表现的参数确定上。这种方法引入了一种特殊的协方差函数,具有指数衰减的函数形式。这是一种容易分析的形式,但是考虑到并没有更好的能收敛的方法所以也可以接受。

在这里插入图片描述
在这里插入图片描述

同样的,内中过程比较复杂,所以还是自己慢慢看吧。总之,就是希望能够只根据参数就预测模型的好坏,参数x使一种超参数设置,t是时间步长,g是有效准确度衡量函数。基于观测和采样进行确定和修正。

2.3 HETEROSCEDASTIC NOISE OF HYPERPARAMETER CONFIGURATION

在这里插入图片描述

上述模型讨论的都是同方差(分布比较均匀吧)的噪声贯穿于超参数设置。这里取了40组全连接层神经网络并且对于每一个都用R=10组伪随机数种子进行评估。数据分布如上图。

Maybe not surprisingly, the noise seems to correlate with the asymptotic performance of a configuration. The fact that the noise between different configurations varies on different orders of magnitudes suggests a heteroscedastic noise model to best describe this behavior. We incorporate this observation by making the noise dependent on the input data to allow to predict different noise levels for different hyperparameters. In principle, one could also model a t dependent noise, but we could not find the same trend across different datasets.

2.4 NEW BASIS FUNCTION LAYER FOR LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS

在这里插入图片描述

最后一部分说一下对于新的贝叶斯函数层对于整个网络的影响。
这里将贝叶斯神经网络和参数函数进行结合,取代原来获取参数的方法。架构见上图。

3 EXPERIMENTS

实验部分还是看原文吧,里面有一些处理的具体细节。

这篇关于LEARNING CURVE PREDICTION WITH BAYESIAN NEURAL NETWORKS 论文详解的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/588599

相关文章

一文深入详解Python的secrets模块

《一文深入详解Python的secrets模块》在构建涉及用户身份认证、权限管理、加密通信等系统时,开发者最不能忽视的一个问题就是“安全性”,Python在3.6版本中引入了专门面向安全用途的secr... 目录引言一、背景与动机:为什么需要 secrets 模块?二、secrets 模块的核心功能1. 基

一文详解MySQL如何设置自动备份任务

《一文详解MySQL如何设置自动备份任务》设置自动备份任务可以确保你的数据库定期备份,防止数据丢失,下面我们就来详细介绍一下如何使用Bash脚本和Cron任务在Linux系统上设置MySQL数据库的自... 目录1. 编写备份脚本1.1 创建并编辑备份脚本1.2 给予脚本执行权限2. 设置 Cron 任务2

一文详解如何在idea中快速搭建一个Spring Boot项目

《一文详解如何在idea中快速搭建一个SpringBoot项目》IntelliJIDEA作为Java开发者的‌首选IDE‌,深度集成SpringBoot支持,可一键生成项目骨架、智能配置依赖,这篇文... 目录前言1、创建项目名称2、勾选需要的依赖3、在setting中检查maven4、编写数据源5、开启热

Python常用命令提示符使用方法详解

《Python常用命令提示符使用方法详解》在学习python的过程中,我们需要用到命令提示符(CMD)进行环境的配置,:本文主要介绍Python常用命令提示符使用方法的相关资料,文中通过代码介绍的... 目录一、python环境基础命令【Windows】1、检查Python是否安装2、 查看Python的安

HTML5 搜索框Search Box详解

《HTML5搜索框SearchBox详解》HTML5的搜索框是一个强大的工具,能够有效提升用户体验,通过结合自动补全功能和适当的样式,可以创建出既美观又实用的搜索界面,这篇文章给大家介绍HTML5... html5 搜索框(Search Box)详解搜索框是一个用于输入查询内容的控件,通常用于网站或应用程

Python中使用uv创建环境及原理举例详解

《Python中使用uv创建环境及原理举例详解》uv是Astral团队开发的高性能Python工具,整合包管理、虚拟环境、Python版本控制等功能,:本文主要介绍Python中使用uv创建环境及... 目录一、uv工具简介核心特点:二、安装uv1. 通过pip安装2. 通过脚本安装验证安装:配置镜像源(可

C++ 函数 strftime 和时间格式示例详解

《C++函数strftime和时间格式示例详解》strftime是C/C++标准库中用于格式化日期和时间的函数,定义在ctime头文件中,它将tm结构体中的时间信息转换为指定格式的字符串,是处理... 目录C++ 函数 strftipythonme 详解一、函数原型二、功能描述三、格式字符串说明四、返回值五

LiteFlow轻量级工作流引擎使用示例详解

《LiteFlow轻量级工作流引擎使用示例详解》:本文主要介绍LiteFlow是一个灵活、简洁且轻量的工作流引擎,适合用于中小型项目和微服务架构中的流程编排,本文给大家介绍LiteFlow轻量级工... 目录1. LiteFlow 主要特点2. 工作流定义方式3. LiteFlow 流程示例4. LiteF

CSS3中的字体及相关属性详解

《CSS3中的字体及相关属性详解》:本文主要介绍了CSS3中的字体及相关属性,详细内容请阅读本文,希望能对你有所帮助... 字体网页字体的三个来源:用户机器上安装的字体,放心使用。保存在第三方网站上的字体,例如Typekit和Google,可以link标签链接到你的页面上。保存在你自己Web服务器上的字

MySQL存储过程之循环遍历查询的结果集详解

《MySQL存储过程之循环遍历查询的结果集详解》:本文主要介绍MySQL存储过程之循环遍历查询的结果集,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录前言1. 表结构2. 存储过程3. 关于存储过程的SQL补充总结前言近来碰到这样一个问题:在生产上导入的数据发现