Related papers: Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

URL: http://arxiv.org/abs/2402.19449v2
Date: Fri, 12 Jul 2024 05:10:32 GMT
Title: Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti,
Abstract summary: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks.
Score: 23.520679217713685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.

Related papers

The Implicit Bias of Adam on Separable Data [27.451499849532176]
We show that when training data are linearly separable, Adam converges towards a linear gradient that achieves diminishing learning rates. Our result shed light on the difference between Adam and (stochastic) descent from a theoretical perspective.
arXiv Detail & Related papers (2024-06-15T14:39:37Z)
Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed [83.8485684139678]
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models. We show that AdaGrad/Adam can have bad high-probability convergence if the noise is heavy-tailed.
arXiv Detail & Related papers (2024-06-06T18:49:10Z)
Class Instance Balanced Learning for Long-Tailed Classification [0.0]
Long-tailed image classification task deals with large imbalances in the class frequencies of the training data. Previous approaches have shown that combining cross-entropy and contrastive learning can improve performance on the long-tailed task. We propose a novel class instance balanced loss (CIBL), which reweights the relative contributions of a cross-entropy and a contrastive loss as a function of the frequency of class instances in the training batch.
arXiv Detail & Related papers (2023-07-11T15:09:10Z)
Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters. However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z)
The Equalization Losses: Gradient-Driven Training for Long-tailed Object Recognition [84.51875325962061]
We propose a gradient-driven training mechanism to tackle the long-tail problem. We introduce a new family of gradient-driven loss functions, namely equalization losses. Our method consistently outperforms the baseline models.
arXiv Detail & Related papers (2022-10-11T16:00:36Z)
A Theoretical Analysis of the Learning Dynamics under Class Imbalance [0.10231119246773925]
We show that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based trajectory. This slowdown is related to the imbalance ratio and can be traced back to a competition between the optimization of different classes. We find that GD is not guaranteed to decrease the loss for each class but that this problem can be addressed by performing a per-class normalization of the gradient.
arXiv Detail & Related papers (2022-07-01T12:54:38Z)
Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning [97.81549071978789]
We propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-04-19T08:23:23Z)
Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition [120.80038161330623]
We show that supervised contrastive learning suffers a dual class-imbalance problem at both the original batch and Siamese batch levels. We propose supervised hard positive and negative pairs mining to pick up informative pairs for contrastive computation and improve representation learning.
arXiv Detail & Related papers (2022-03-22T07:30:38Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Distributional Robustness Loss for Long-tail Learning [20.800627115140465]
Real-world data is often unbalanced and long-tailed, but deep models struggle to recognize rare classes in the presence of frequent classes. We show that the feature extractor part of deep networks suffers greatly from this bias. We propose a new loss based on robustness theory, which encourages the model to learn high-quality representations for both head and tail classes.
arXiv Detail & Related papers (2021-04-07T11:34:04Z)
Understanding self-supervised Learning Dynamics without Contrastive Pairs [72.1743263777693]
Contrastive approaches to self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point. BYOL and SimSiam, show remarkable performance it without negative pairs. We study the nonlinear learning dynamics of non-contrastive SSL in simple linear networks.
arXiv Detail & Related papers (2021-02-12T22:57:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.