Related papers: Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

URL: http://arxiv.org/abs/2210.14199v1
Date: Tue, 25 Oct 2022 17:45:36 GMT
Title: Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Authors: Hong Liu, Sang Michael Xie, Zhiyuan Li, Tengyu Ma
Abstract summary: This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
Score: 46.24479693469042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pre-training algorithms/optimizers -- among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

Related papers

Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets. We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z)
Dynamic Scheduled Sampling with Imitation Loss for Neural Text Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy. DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
Cold-start Active Learning through Self-supervised Language Modeling [15.551710499866239]
Active learning aims to reduce annotation costs by choosing the most critical examples to label. With BERT, we develop a simple strategy based on the masked language modeling loss. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and time.
arXiv Detail & Related papers (2020-10-19T14:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.