MTAdam: Automatic Balancing of Multiple Training Loss Terms
- URL: http://arxiv.org/abs/2006.14683v1
- Date: Thu, 25 Jun 2020 20:27:27 GMT
- Title: MTAdam: Automatic Balancing of Multiple Training Loss Terms
- Authors: Itzik Malkiel, Lior Wolf
- Abstract summary: We generalize the Adam optimization algorithm to handle multiple loss terms.
We show that training with the new method leads to fast recovery from suboptimal initial loss weighting.
- Score: 95.99508450208813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When training neural models, it is common to combine multiple loss terms. The
balancing of these terms requires considerable human effort and is
computationally demanding. Moreover, the optimal trade-off between the loss
term can change as training progresses, especially for adversarial terms. In
this work, we generalize the Adam optimization algorithm to handle multiple
loss terms. The guiding principle is that for every layer, the gradient
magnitude of the terms should be balanced. To this end, the Multi-Term Adam
(MTAdam) computes the derivative of each loss term separately, infers the first
and second moments per parameter and loss term, and calculates a first moment
for the magnitude per layer of the gradients arising from each loss. This
magnitude is used to continuously balance the gradients across all layers, in a
manner that both varies from one layer to the next and dynamically changes over
time. Our results show that training with the new method leads to fast recovery
from suboptimal initial loss weighting and to training outcomes that match
conventional training with the prescribed hyperparameters of each method.
Related papers
- Implicit biases in multitask and continual learning from a backward
error analysis perspective [5.710971447109951]
We compute implicit training biases in multitask and continual learning settings for neural networks trained with gradient descent.
We derive modified losses that are implicitly minimized during training.
arXiv Detail & Related papers (2023-11-01T02:37:32Z) - Cut your Losses with Squentropy [19.924900110707284]
We propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes.
We show that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.
arXiv Detail & Related papers (2023-02-08T09:21:13Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Between Stochastic and Adversarial Online Convex Optimization: Improved
Regret Bounds via Smoothness [2.628557920905129]
We establish novel regret bounds for online convex optimization in a setting that interpolates between i.i.d. and fully adversarial losses.
To accomplish this goal, we introduce two key quantities associated with the loss sequence, that we call the cumulative variance and the adversarial variation.
In the fully i.i.d. case, our bounds match the rates one would expect from results in acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret.
arXiv Detail & Related papers (2022-02-15T16:39:33Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - The Golden Ratio of Learning and Momentum [0.5076419064097732]
This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse.
All results taken together show that loss, learning rate, and momentum are closely connected.
arXiv Detail & Related papers (2020-06-08T17:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.