Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization
- URL: http://arxiv.org/abs/2502.17024v1
- Date: Mon, 24 Feb 2025 10:26:29 GMT
- Title: Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization
- Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable in-context learning abilities.<n>This paper investigates how ICL emerges and the impact of pre-training phase on ICL.<n>Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
- Score: 26.9153121765435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
Related papers
- Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension [12.563060744760651]
Relation-R1 is the first unified relational comprehension framework.
It integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO)
Experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and textitN-ary relation understanding.
arXiv Detail & Related papers (2025-04-20T14:50:49Z) - Parallel Structures in Pre-training Data Yield In-Context Learning [41.27837171531926]
We study what patterns of the pre-training data contribute to in-context learning (ICL)
We find that LMs' ICL ability depends on $textitparallel structures$ in the pre-training data.
arXiv Detail & Related papers (2024-02-19T20:40:48Z) - In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax [36.98247762224868]
In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks.
Do models infer the underlying structure of the task defined by the context, or do they rely on superficial generalizations that only generalize to identically distributed examples?
In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs.
The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size.
arXiv Detail & Related papers (2023-11-13T23:52:43Z) - In-Context Learning Learns Label Relationships but Is Not Conventional
Learning [60.891931501449726]
There is currently no consensus about how in-context learning (ICL) ability of Large Language Models works.
We provide novel insights into how ICL leverages label information, revealing both capabilities and limitations.
Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context.
arXiv Detail & Related papers (2023-07-23T16:54:41Z) - Understanding In-Context Learning via Supportive Pretraining Data [55.648777340129364]
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time.
It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations.
Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data.
arXiv Detail & Related papers (2023-06-26T22:14:04Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z) - A Theory of Emergent In-Context Learning as Implicit Structure Induction [8.17811111226145]
Scaling large language models leads to an emergent capacity to learn in-context from example demonstrations.
We argue that in-context learning relies on recombination of compositional operations found in natural language data.
We show how in-context learning is supported by a representation of the input's compositional structure.
arXiv Detail & Related papers (2023-03-14T15:24:05Z) - A Survey on In-context Learning [77.78614055956365]
In-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP)
We first present a formal definition of ICL and clarify its correlation to related studies.
We then organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis.
arXiv Detail & Related papers (2022-12-31T15:57:09Z) - A Multi-level Supervised Contrastive Learning Framework for Low-Resource
Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding.
Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.