Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models
- URL: http://arxiv.org/abs/2210.14199v1
- Date: Tue, 25 Oct 2022 17:45:36 GMT
- Title: Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models
- Authors: Hong Liu, Sang Michael Xie, Zhiyuan Li, Tengyu Ma
- Abstract summary: This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
- Score: 46.24479693469042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language modeling on large-scale datasets leads to impressive performance
gains on various downstream language tasks. The validation pre-training loss
(or perplexity in autoregressive language modeling) is often used as the
evaluation metric when developing language models since the pre-training loss
tends to be well-correlated with downstream performance (which is itself
difficult to evaluate comprehensively). Contrary to this conventional wisdom,
this paper shows that 1) pre-training loss cannot fully explain downstream
performance and 2) flatness of the model is well-correlated with downstream
performance where pre-training loss is not. On simplified datasets, we identify
three ways to produce models with the same (statistically optimal) pre-training
loss but different downstream performance: continue pre-training after
convergence, increasing the model size, and changing the training algorithm.
These experiments demonstrate the existence of implicit bias of pre-training
algorithms/optimizers -- among models with the same minimal pre-training loss,
they implicitly prefer more transferable ones. Toward understanding this
implicit bias, we prove that SGD with standard mini-batch noise implicitly
prefers flatter minima in language models, and empirically observe a strong
correlation between flatness and downstream performance among models with the
same minimal pre-training loss. We also prove in a synthetic language setting
that among the models with the minimal pre-training loss, the flattest model
transfers to downstream tasks.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - Cold-start Active Learning through Self-supervised Language Modeling [15.551710499866239]
Active learning aims to reduce annotation costs by choosing the most critical examples to label.
With BERT, we develop a simple strategy based on the masked language modeling loss.
Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and time.
arXiv Detail & Related papers (2020-10-19T14:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.