Step-size Optimization for Continual Learning
- URL: http://arxiv.org/abs/2401.17401v1
- Date: Tue, 30 Jan 2024 19:35:43 GMT
- Title: Step-size Optimization for Continual Learning
- Authors: Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, Richard
Sutton
- Abstract summary: In continual learning, a learner has to keep learning from the data over its whole life time.
In a neural network, this can be implemented by using a step-size vector to scale how much samples change network weights.
Common algorithms, like RMSProp and Adam, use gradients, specifically normalization, to adapt this step-size vector.
- Score: 5.834516080130717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In continual learning, a learner has to keep learning from the data over its
whole life time. A key issue is to decide what knowledge to keep and what
knowledge to let go. In a neural network, this can be implemented by using a
step-size vector to scale how much gradient samples change network weights.
Common algorithms, like RMSProp and Adam, use heuristics, specifically
normalization, to adapt this step-size vector. In this paper, we show that
those heuristics ignore the effect of their adaptation on the overall objective
function, for example by moving the step-size vector away from better step-size
vectors. On the other hand, stochastic meta-gradient descent algorithms, like
IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to
the overall objective function. On simple problems, we show that IDBD is able
to consistently improve step-size vectors, where RMSProp and Adam do not. We
explain the differences between the two approaches and their respective
limitations. We conclude by suggesting that combining both approaches could be
a promising future direction to improve the performance of neural networks in
continual learning.
Related papers
- Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks.
Approximate MU is a practical method for large-scale models.
We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z) - Knowledge Composition using Task Vectors with Learned Anisotropic Scaling [51.4661186662329]
We introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level.
We show that such linear combinations explicitly exploit the low intrinsicity of pre-trained models, with only a few coefficients being the learnable parameters.
We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives.
arXiv Detail & Related papers (2024-07-03T07:54:08Z) - How Two-Layer Neural Networks Learn, One (Giant) Step at a Time [24.773974771715956]
We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function.
We compare the influence of batch size and that of multiple (but finitely many) steps.
We show that a batch-size of $n = mathcalO(d)$ is indeed enough to learn multiple target directions satisfying a staircase property.
arXiv Detail & Related papers (2023-05-29T17:43:44Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Recomposing the Reinforcement Learning Building Blocks with
Hypernetworks [19.523737925041278]
We show that a primary network determines the weights of a conditional dynamic network.
This approach improves the gradient approximation and reduces the learning step variance.
We demonstrate a consistent improvement across different locomotion tasks and different algorithms both in RL (TD3 and SAC) and in Meta-RL (MAML and PEARL)
arXiv Detail & Related papers (2021-06-12T19:43:12Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.