The Fair Language Model Paradox
- URL: http://arxiv.org/abs/2410.11985v1
- Date: Tue, 15 Oct 2024 18:47:12 GMT
- Title: The Fair Language Model Paradox
- Authors: Andrea Pinto, Tomer Galanti, Randall Balestriero,
- Abstract summary: Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level.
We show that as weight decay increases, low-frequency tokens are disproportionately depreciated.
This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages.
- Score: 19.439996884827448
- License:
- Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.
Related papers
- Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing [6.726629754291751]
We introduce a method for quantifying the frequency bias of a language model.
We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training.
This approach results in better performance on infrequent English tokens and a decrease in anisotropy.
arXiv Detail & Related papers (2024-10-15T10:09:57Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Token-Level Fitting Issues of Seq2seq Models [15.81037035729968]
Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks.
We find that seq2seq models trained with early-stopping suffer from issues at the token level.
arXiv Detail & Related papers (2023-05-08T06:40:24Z) - A Theoretical Understanding of Shallow Vision Transformers: Learning,
Generalization, and Sample Complexity [71.11795737362459]
ViTs with self-attention modules have recently achieved great empirical success in many tasks.
However, theoretical learning generalization analysis is mostly noisy and elusive.
This paper provides the first theoretical analysis of a shallow ViT for a classification task.
arXiv Detail & Related papers (2023-02-12T22:12:35Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.