Related papers: Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training

Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training

URL: http://arxiv.org/abs/2108.13880v1
Date: Tue, 31 Aug 2021 14:36:23 GMT
Title: Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training
Authors: Maximus Mutschler and Andreas Zell
Abstract summary: This work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule.
Score: 21.35522589789314
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A fundamental challenge in Deep Learning is to find optimal step sizes for stochastic gradient descent. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the full-batch loss is unfeasible expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest that the full-batch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size is changing slowly. By exploiting these findings, this work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. Learning rates are derived from such parabolas during training. In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule and other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy.

Related papers

Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training. We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z)
Fighting Uncertainty with Gradients: Offline Reinforcement Learning via Diffusion Score Matching [22.461036967440723]
We study smoothed distance to data as an uncertainty metric, and claim that it has two beneficial properties. We show these gradients can be efficiently learned with score-matching techniques. We propose Score-Guided Planning (SGP) to enable first-order planning in high-dimensional problems.
arXiv Detail & Related papers (2023-06-24T23:40:58Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares [8.443742714362521]
We develop an algorithm for one-pass learning which seeks to perfectly fit every new datapoint while changing the parameters in a direction that causes the least change to the predictions on previous datapoints. Our algorithm uses the memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA) Our experiments show the effectiveness of the proposed method compared to the baselines.
arXiv Detail & Related papers (2022-07-28T02:01:31Z)
Accelerating Deep Learning with Dynamic Data Pruning [0.0]
Deep learning has become prohibitively costly, requiring access to powerful computing systems to train state-of-the-art networks. Previous work, such as forget scores and GraNd/EL2N scores, identify important samples within a full dataset and pruning the remaining samples, thereby reducing the iterations per epoch. We propose two algorithms, based on reinforcement learning techniques, to dynamically prune samples and achieve even higher accuracy than the random dynamic method.
arXiv Detail & Related papers (2021-11-24T16:47:34Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Low-Rank Robust Online Distance/Similarity Learning based on the Rescaled Hinge Loss [0.34376560669160383]
Existing online methods usually assume training triplets or pairwise constraints are exist in advance. We formulate the online Distance-Similarity learning problem with the robust Rescaled hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm.
arXiv Detail & Related papers (2020-10-07T08:38:34Z)
A straightforward line search approach on the expected empirical loss for stochastic deep learning problems [20.262526694346104]
It is too costly to search for good step sizes on the expected empirical loss due to noisy losses in deep learning. This work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply.
arXiv Detail & Related papers (2020-10-02T11:04:02Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.