Related papers: A Stochastic Bundle Method for Interpolating Networks

A Stochastic Bundle Method for Interpolating Networks

URL: http://arxiv.org/abs/2201.12678v1
Date: Sat, 29 Jan 2022 23:02:30 GMT
Title: A Stochastic Bundle Method for Interpolating Networks
Authors: Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, M. Pawan Kumar
Abstract summary: We propose a novel method for training deep neural networks that are capable of driving the empirical loss to zero. At each iteration our method constructs a maximum linear approximation, known as the bundle of the objective learning approximation.
Score: 18.313879914379008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel method for training deep neural networks that are capable of interpolation, that is, driving the empirical loss to zero. At each iteration, our method constructs a stochastic approximation of the learning objective. The approximation, known as a bundle, is a pointwise maximum of linear functions. Our bundle contains a constant function that lower bounds the empirical loss. This enables us to compute an automatic adaptive learning rate, thereby providing an accurate solution. In addition, our bundle includes linear approximations computed at the current iterate and other linear estimates of the DNN parameters. The use of these additional approximations makes our method significantly more robust to its hyperparameters. Based on its desirable empirical properties, we term our method Bundle Optimisation for Robust and Accurate Training (BORAT). In order to operationalise BORAT, we design a novel algorithm for optimising the bundle approximation efficiently at each iteration. We establish the theoretical convergence of BORAT in both convex and non-convex settings. Using standard publicly available data sets, we provide a thorough comparison of BORAT to other single hyperparameter optimisation algorithms. Our experiments demonstrate BORAT matches the state-of-the-art generalisation performance for these methods and is the most robust.

Related papers

Enhanced Derivative-Free Optimization Using Adaptive Correlation-Induced Finite Difference Estimators [6.054123928890574]
We develop an algorithm designed to enhance DFO in terms of both gradient estimation efficiency and sample efficiency. We establish the consistency of our proposed algorithm and demonstrate that, despite using a batch of samples per iteration, it achieves the same convergence rate as the KW and SPSA methods.
arXiv Detail & Related papers (2025-02-28T08:05:54Z)
Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization [0.5120567378386615]
First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning. However, these methods do not exploit curvature information. Quasi-Newton methods re-use previously computed low Hessian approximations.
arXiv Detail & Related papers (2025-02-17T20:20:11Z)
Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data [9.913418444556486]
We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We also present a novel, accurate, and fast way to calculate predictive variances relying on estimations and iterative methods. All methods are implemented in a free C++ software library with high-level Python and R packages.
arXiv Detail & Related papers (2024-05-23T12:25:22Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS) We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z)
Provably Efficient Bayesian Optimization with Unknown Gaussian Process Hyperparameter Estimation [44.53678257757108]
We propose a new BO method that can sub-linearly converge to the objective function's global optimum. Our method uses a multi-armed bandit technique (EXP3) to add random data points to the BO process. We demonstrate empirically that our method outperforms existing approaches on various synthetic and real-world problems.
arXiv Detail & Related papers (2023-06-12T03:35:45Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics [3.24890820102255]
We propose a new technique for constructing low-rank approximations of matrices that arise in kernel methods for machine learning. Our approach pairs a novel automatically constructed analytic expansion of the underlying kernel function with a data-dependent compression step to further optimize the approximation. Experimental results show our approach compares favorably to the commonly used Nystrom method with respect to both accuracy for a given rank and computational time for a given accuracy across a variety of kernels, dimensions, and datasets.
arXiv Detail & Related papers (2022-02-08T05:19:39Z)
Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling [0.9806910643086042]
We develop a new method of online inference for a vector of parameters estimated by the Polyak-Rtupper averaging procedure of gradient descent algorithms. Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem.
arXiv Detail & Related papers (2021-06-06T15:38:37Z)
Minibatch and Momentum Model-based Methods for Stochastic Non-smooth Non-convex Optimization [3.4809730725241597]
We make two important extensions to model-based methods. First, we propose a new minibatch which takes a set of samples to approximate the model function in each iteration. Second, by the success of momentum techniques we propose a new convex-based model.
arXiv Detail & Related papers (2021-06-06T05:31:57Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)
Support recovery and sup-norm convergence rates for sparse pivotal estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.