Using a one dimensional parabolic model of the full-batch loss to
estimate learning rates during training
- URL: http://arxiv.org/abs/2108.13880v1
- Date: Tue, 31 Aug 2021 14:36:23 GMT
- Title: Using a one dimensional parabolic model of the full-batch loss to
estimate learning rates during training
- Authors: Maximus Mutschler and Andreas Zell
- Abstract summary: This work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches.
In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule.
- Score: 21.35522589789314
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A fundamental challenge in Deep Learning is to find optimal step sizes for
stochastic gradient descent. In traditional optimization, line searches are a
commonly used method to determine step sizes. One problem in Deep Learning is
that finding appropriate step sizes on the full-batch loss is unfeasible
expensive. Therefore, classical line search approaches, designed for losses
without inherent noise, are usually not applicable. Recent empirical findings
suggest that the full-batch loss behaves locally parabolically in the direction
of noisy update step directions. Furthermore, the trend of the optimal update
step size is changing slowly. By exploiting these findings, this work
introduces a line-search method that approximates the full-batch loss with a
parabola estimated over several mini-batches. Learning rates are derived from
such parabolas during training. In the experiments conducted, our approach
mostly outperforms SGD tuned with a piece-wise constant learning rate schedule
and other line search approaches for Deep Learning across models, datasets, and
batch sizes on validation and test accuracy.
Related papers
- Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Fighting Uncertainty with Gradients: Offline Reinforcement Learning via
Diffusion Score Matching [22.461036967440723]
We study smoothed distance to data as an uncertainty metric, and claim that it has two beneficial properties.
We show these gradients can be efficiently learned with score-matching techniques.
We propose Score-Guided Planning (SGP) to enable first-order planning in high-dimensional problems.
arXiv Detail & Related papers (2023-06-24T23:40:58Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive
Least-Squares [8.443742714362521]
We develop an algorithm for one-pass learning which seeks to perfectly fit every new datapoint while changing the parameters in a direction that causes the least change to the predictions on previous datapoints.
Our algorithm uses the memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA)
Our experiments show the effectiveness of the proposed method compared to the baselines.
arXiv Detail & Related papers (2022-07-28T02:01:31Z) - Accelerating Deep Learning with Dynamic Data Pruning [0.0]
Deep learning has become prohibitively costly, requiring access to powerful computing systems to train state-of-the-art networks.
Previous work, such as forget scores and GraNd/EL2N scores, identify important samples within a full dataset and pruning the remaining samples, thereby reducing the iterations per epoch.
We propose two algorithms, based on reinforcement learning techniques, to dynamically prune samples and achieve even higher accuracy than the random dynamic method.
arXiv Detail & Related papers (2021-11-24T16:47:34Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Low-Rank Robust Online Distance/Similarity Learning based on the
Rescaled Hinge Loss [0.34376560669160383]
Existing online methods usually assume training triplets or pairwise constraints are exist in advance.
We formulate the online Distance-Similarity learning problem with the robust Rescaled hinge loss function.
The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm.
arXiv Detail & Related papers (2020-10-07T08:38:34Z) - A straightforward line search approach on the expected empirical loss
for stochastic deep learning problems [20.262526694346104]
It is too costly to search for good step sizes on the expected empirical loss due to noisy losses in deep learning.
This work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply.
arXiv Detail & Related papers (2020-10-02T11:04:02Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.