Implicit biases in multitask and continual learning from a backward
error analysis perspective
- URL: http://arxiv.org/abs/2311.00235v1
- Date: Wed, 1 Nov 2023 02:37:32 GMT
- Title: Implicit biases in multitask and continual learning from a backward
error analysis perspective
- Authors: Benoit Dherin
- Abstract summary: We compute implicit training biases in multitask and continual learning settings for neural networks trained with gradient descent.
We derive modified losses that are implicitly minimized during training.
- Score: 5.710971447109951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using backward error analysis, we compute implicit training biases in
multitask and continual learning settings for neural networks trained with
stochastic gradient descent. In particular, we derive modified losses that are
implicitly minimized during training. They have three terms: the original loss,
accounting for convergence, an implicit flatness regularization term
proportional to the learning rate, and a last term, the conflict term, which
can theoretically be detrimental to both convergence and implicit
regularization. In multitask, the conflict term is a well-known quantity,
measuring the gradient alignment between the tasks, while in continual learning
the conflict term is a new quantity in deep learning optimization, although a
basic tool in differential geometry: The Lie bracket between the task
gradients.
Related papers
- A mean curvature flow arising in adversarial training [1.2289361708127877]
We connect adversarial training for binary classification to a geometric evolution equation for the decision boundary.
We prove that the scheme is monotone and consistent as the adversarial budget vanishes and the perimeter localizes.
This highlights that the efficacy of adversarial training may be due to locally minimizing the length of the decision boundary.
arXiv Detail & Related papers (2024-04-22T17:58:36Z) - On Uniform Scalar Quantization for Learned Image Compression [17.24702997651976]
We find two factors crucial: the discrepancy between the surrogate and rounding, leading to train-test mismatch, and gradient estimation risk due to the surrogate.
Our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance of the estimated quantized latent distribution, which effectively reduces the train-test mismatch.
Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.
arXiv Detail & Related papers (2023-09-29T08:23:36Z) - Online Regularized Learning Algorithm for Functional Data [2.5382095320488673]
This paper considers online regularized learning algorithm in Hilbert kernel spaces.
It shows that convergence rates of both prediction error and estimation error with constant step-size are competitive with those in the literature.
arXiv Detail & Related papers (2022-11-24T11:56:10Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Generalized Negative Correlation Learning for Deep Ensembling [7.569288952340753]
Ensemble algorithms offer state of the art performance in many machine learning applications.
We formulate a generalized bias-variance decomposition for arbitrary twice differentiable loss functions.
We derive a Generalized Negative Correlation Learning algorithm which offers explicit control over the ensemble's diversity.
arXiv Detail & Related papers (2020-11-05T16:29:22Z) - MTAdam: Automatic Balancing of Multiple Training Loss Terms [95.99508450208813]
We generalize the Adam optimization algorithm to handle multiple loss terms.
We show that training with the new method leads to fast recovery from suboptimal initial loss weighting.
arXiv Detail & Related papers (2020-06-25T20:27:27Z) - Optimization and Generalization of Regularization-Based Continual
Learning: a Loss Approximation Viewpoint [35.5156045701898]
We provide a novel viewpoint of regularization-based continual learning by formulating it as a second-order Taylor approximation of the loss function of each task.
Based on this viewpoint, we study the optimization aspects (i.e., convergence) as well as generalization properties (i.e., finite-sample guarantees) of regularization-based continual learning.
arXiv Detail & Related papers (2020-06-19T06:08:40Z) - Total Deep Variation: A Stable Regularizer for Inverse Problems [71.90933869570914]
We introduce the data-driven general-purpose total deep variation regularizer.
In its core, a convolutional neural network extracts local features on multiple scales and in successive blocks.
We achieve state-of-the-art results for numerous imaging tasks.
arXiv Detail & Related papers (2020-06-15T21:54:15Z) - Towards Certified Robustness of Distance Metric Learning [53.96113074344632]
We advocate imposing an adversarial margin in the input space so as to improve the generalization and robustness of metric learning algorithms.
We show that the enlarged margin is beneficial to the generalization ability by using the theoretical technique of algorithmic robustness.
arXiv Detail & Related papers (2020-06-10T16:51:53Z) - Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.
We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.