Token-Level Fitting Issues of Seq2seq Models
- URL: http://arxiv.org/abs/2305.04493v2
- Date: Thu, 22 Jun 2023 07:42:08 GMT
- Title: Token-Level Fitting Issues of Seq2seq Models
- Authors: Guangsheng Bao, Zhiyang Teng, Yue Zhang
- Abstract summary: Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks.
We find that seq2seq models trained with early-stopping suffer from issues at the token level.
- Score: 15.81037035729968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sequence-to-sequence (seq2seq) models have been widely used for natural
language processing, computer vision, and other deep learning tasks. We find
that seq2seq models trained with early-stopping suffer from issues at the token
level. In particular, while some tokens in the vocabulary demonstrate
overfitting, others underfit when training is stopped. Experiments show that
the phenomena are pervasive in different models, even in fine-tuned large
pretrained-models. We identify three major factors that influence token-level
fitting, which include token frequency, parts-of-speech, and prediction
discrepancy. Further, we find that external factors such as language, model
size, domain, data scale, and pretraining can also influence the fitting of
tokens.
Related papers
- The Fair Language Model Paradox [19.439996884827448]
Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level.
We show that as weight decay increases, low-frequency tokens are disproportionately depreciated.
This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages.
arXiv Detail & Related papers (2024-10-15T18:47:12Z) - On the Proper Treatment of Tokenization in Psycholinguistics [53.960910019072436]
The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies.
We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
arXiv Detail & Related papers (2024-10-03T17:18:03Z) - Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models [4.165536532090932]
The disconnect between tokenizer creation and model training in language models allows for specific inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted model behaviour.
We present a comprehensive analysis of Large Language Model tokenizers, specifically targeting this issue of detecting under-trained tokens.
Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop novel and effective methods for automatically detecting these problematic tokens.
arXiv Detail & Related papers (2024-05-08T20:37:56Z) - IMO: Greedy Layer-Wise Sparse Representation Learning for Out-of-Distribution Text Classification with Pre-trained Models [56.10157988449818]
This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training.
We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features.
arXiv Detail & Related papers (2024-04-21T02:15:59Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Robustness of Demonstration-based Learning Under Limited Data Scenario [54.912936555876826]
Demonstration-based learning has shown great potential in stimulating pretrained language models' ability under limited data scenario.
Why such demonstrations are beneficial for the learning process remains unclear since there is no explicit alignment between the demonstrations and the predictions.
In this paper, we design pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling.
arXiv Detail & Related papers (2022-10-19T16:15:04Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.