Related papers: A straightforward line search approach on the expected empirical loss for stochastic deep learning problems

A straightforward line search approach on the expected empirical loss for stochastic deep learning problems

URL: http://arxiv.org/abs/2010.00921v1
Date: Fri, 2 Oct 2020 11:04:02 GMT
Title: A straightforward line search approach on the expected empirical loss for stochastic deep learning problems
Authors: Maximus Mutschler and Andreas Zell
Abstract summary: It is too costly to search for good step sizes on the expected empirical loss due to noisy losses in deep learning. This work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply.
Score: 20.262526694346104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A fundamental challenge in deep learning is that the optimal step sizes for update steps of stochastic gradient descent are unknown. In traditional optimization, line searches are used to determine good step sizes, however, in deep learning, it is too costly to search for good step sizes on the expected empirical loss due to noisy losses. This empirical work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply. This is achieved by applying traditional one-dimensional function fitting to measured noisy losses of such cross sections. The step to a minimum of the resulting approximation is then used as step size for the optimization. This approach leads to a robust and straightforward optimization method which performs well across datasets and architectures without the need of hyperparameter tuning.

Related papers

Training-set-free two-stage deep learning for spectroscopic data de-noising [0.0]
De-noising is a prominent step in the spectra post-processing procedure. Previous machine learning-based methods are fast but mostly based on supervised learning. Unsupervised-based algorithms are slow and require a training set that may be typically expensive in real experimental measurements.
arXiv Detail & Related papers (2024-02-29T03:31:41Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems. Conventional methods need to differentiate through the low-level optimization process with implicit differentiation. First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training [21.35522589789314]
This work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule.
arXiv Detail & Related papers (2021-08-31T14:36:23Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems. We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
Low-Rank Robust Online Distance/Similarity Learning based on the Rescaled Hinge Loss [0.34376560669160383]
Existing online methods usually assume training triplets or pairwise constraints are exist in advance. We formulate the online Distance-Similarity learning problem with the robust Rescaled hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm.
arXiv Detail & Related papers (2020-10-07T08:38:34Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule. We introduce a "grafting" experiment which decouples an update's magnitude from its direction. We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.