Overtrained Language Models Are Harder to Fine-Tune
- URL: http://arxiv.org/abs/2503.19206v2
- Date: Fri, 28 Mar 2025 02:10:05 GMT
- Title: Overtrained Language Models Are Harder to Fine-Tune
- Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan,
- Abstract summary: Large language models are pre-trained on ever-growing token budgets.<n>We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
- Score: 64.44743256512237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
Related papers
- Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models [17.288865972774587]
We investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints.<n>Our results on 18 datasets suggest that pre-training improves the model in a latent way that unveils after fine-tuning.
arXiv Detail & Related papers (2024-08-13T06:28:43Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization [58.90989478049686]
Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets.
Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
arXiv Detail & Related papers (2023-05-24T06:09:26Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z) - Adversarial Self-Attention for Language Understanding [89.265747130584]
This paper proposes textitAdversarial Self-Attention mechanism (ASA).
ASA adversarially reconstructs the Transformer attentions and facilitates model training from contaminated model structures.
For fine-tuning, ASA-empowered models consistently outweigh naive models by a large margin considering both generalization and robustness.
arXiv Detail & Related papers (2022-06-25T09:18:10Z) - Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline.
While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z) - Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning [134.15174177472807]
We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time.
We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins.
arXiv Detail & Related papers (2020-03-28T18:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.