Related papers: Parallel Structures in Pre-training Data Yield In-Context Learning

Parallel Structures in Pre-training Data Yield In-Context Learning

URL: http://arxiv.org/abs/2402.12530v1
Date: Mon, 19 Feb 2024 20:40:48 GMT
Title: Parallel Structures in Pre-training Data Yield In-Context Learning
Authors: Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He
Abstract summary: We study what patterns of the pre-training data contribute to in-context learning (ICL) We find that LMs' ICL ability depends on $textitparallel structures$ in the pre-training data.
Score: 41.27837171531926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.

Related papers

Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization [26.9153121765435]
Large language models (LLMs) have demonstrated remarkable in-context learning abilities. This paper investigates how ICL emerges and the impact of pre-training phase on ICL. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
arXiv Detail & Related papers (2025-02-24T10:26:29Z)
What Matters for In-Context Learning: A Balancing Act of Look-up and In-Weight Learning [42.8453045943264]
We show that conceptual repetitions in the data sequences are crucial for ICL. We also show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability.
arXiv Detail & Related papers (2025-01-09T09:45:05Z)
Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods [69.36397993451742]
This work introduces Context-aware Prompt Tuning (CPT), a method inspired by ICL, PT, and adversarial attacks. We modify specific context tokens, considering the unique structure of input and output formats. Inspired by adversarial attacks, we adjust the input based on the labels present in the context, focusing on minimizing, rather than maximizing, the loss.
arXiv Detail & Related papers (2024-10-22T17:45:47Z)
Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning [22.341935761925892]
Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. This paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning.
arXiv Detail & Related papers (2024-10-07T02:12:22Z)
From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When [19.841163050181194]
Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities. We examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data. We identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.
arXiv Detail & Related papers (2024-05-31T18:46:06Z)
Understanding In-Context Learning via Supportive Pretraining Data [55.648777340129364]
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data.
arXiv Detail & Related papers (2023-06-26T22:14:04Z)
Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
Data Curation Alone Can Stabilize In-context Learning [20.874674130060388]
In-context learning (ICL) enables large language models to perform new tasks by prompting them with a sequence of training examples. randomly sampling examples from a training set leads to high variance in performance. We show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm.
arXiv Detail & Related papers (2022-12-20T15:58:54Z)
An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning. We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept. We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.