Parallel Structures in Pre-training Data Yield In-Context Learning
- URL: http://arxiv.org/abs/2402.12530v1
- Date: Mon, 19 Feb 2024 20:40:48 GMT
- Title: Parallel Structures in Pre-training Data Yield In-Context Learning
- Authors: Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He
- Abstract summary: We study what patterns of the pre-training data contribute to in-context learning (ICL)
We find that LMs' ICL ability depends on $textitparallel structures$ in the pre-training data.
- Score: 41.27837171531926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (LMs) are capable of in-context learning (ICL):
they can adapt to a task with only a few examples given in the prompt without
any parameter update. However, it is unclear where this capability comes from
as there is a stark distribution shift between pre-training text and ICL
prompts. In this work, we study what patterns of the pre-training data
contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel
structures}$ in the pre-training data -- pairs of phrases following similar
templates in the same context window. Specifically, we detect parallel
structures by checking whether training on one phrase improves prediction of
the other, and conduct ablation experiments to study their effect on ICL. We
show that removing parallel structures in the pre-training data reduces LMs'
ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when
excluding common patterns such as n-gram repetitions and long-range dependency,
showing the diversity and generality of parallel structures. A closer look at
the detected parallel structures indicates that they cover diverse linguistic
tasks and span long distances in the data.
Related papers
- How In-Context Learning Emerges from Training on Unstructured Data: On the Role of Co-Occurrence, Positional Information, and Noise Structures [19.841163050181194]
Large language models (LLMs) like transformers have impressive in-context learning (ICL) capabilities.
We investigate how ICL emerges from unsupervised training on unstructured data.
We establish the necessity of positional information and noise structure to generalize ICL to unseen data.
arXiv Detail & Related papers (2024-05-31T18:46:06Z) - Towards a theory of how the structure of language is acquired by deep neural networks [6.363756171493383]
We use a hierarchical generative model that captures the tree-like structure of natural languages.
We show that token-token correlations can be used to build a representation of the grammar's hidden variables.
We conjecture that the relationship between training set size and effective range of correlations holds beyond our datasets.
arXiv Detail & Related papers (2024-05-28T17:01:22Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Understanding In-Context Learning via Supportive Pretraining Data [55.648777340129364]
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time.
It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations.
Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data.
arXiv Detail & Related papers (2023-06-26T22:14:04Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z) - Data Curation Alone Can Stabilize In-context Learning [20.874674130060388]
In-context learning (ICL) enables large language models to perform new tasks by prompting them with a sequence of training examples.
randomly sampling examples from a training set leads to high variance in performance.
We show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm.
arXiv Detail & Related papers (2022-12-20T15:58:54Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.