MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models
- URL: http://arxiv.org/abs/2310.19531v6
- Date: Mon, 25 Mar 2024 08:46:58 GMT
- Title: MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models
- Authors: Zhenpeng Su, Xing Wu, Xue Bai, Zijia Lin, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu,
- Abstract summary: We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
- Score: 40.992566245706996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
Related papers
- Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training [57.771940716189114]
We show that large language models (LLMs) suffer from the "reversal curse"
The root cause of the reversal curse lies in the different word order between the training and inference stage.
We propose Semantic-aware Permutation Training (SPT) to address this issue.
arXiv Detail & Related papers (2024-03-01T18:55:20Z) - Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks.
They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans.
We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z) - Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability [25.52470575274251]
We observe that language models generate short repetitive phrases before learning to generate longer and more coherent text.
Individual tokens often exhibit sudden increases or decreases in loss that are surprisingly consistent across pre-training runs.
More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training.
arXiv Detail & Related papers (2023-08-29T16:24:09Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.