Related papers: Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

URL: http://arxiv.org/abs/2402.00743v2
Date: Tue, 18 Jun 2024 13:11:32 GMT
Title: Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data
Authors: Yue Xing, Xiaofeng Lin, Chenheng Xu, Namjoon Suh, Qifan Song, Guang Cheng,
Abstract summary: Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL) This paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture.
Score: 21.242708937367865
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.

Related papers

Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z)
Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
In this paper, we demonstrate that transformers can perform full Bayesian inference for commonly used statistical models in context.<n>We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows.<n>Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z)
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers [18.077009146950473]
Autoregressive transformers exhibit adaptive learning through in-context learning (ICL) We propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
arXiv Detail & Related papers (2024-12-16T19:00:18Z)
Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory [37.93644115914534]
In-context learning (ICL) enables large language models to adapt to new tasks without fine-tuning. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction.
arXiv Detail & Related papers (2024-12-16T05:33:05Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When [19.841163050181194]
Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities. We examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data. We identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.
arXiv Detail & Related papers (2024-05-31T18:46:06Z)
Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z)
Parallel Structures in Pre-training Data Yield In-Context Learning [41.27837171531926]
We study what patterns of the pre-training data contribute to in-context learning (ICL) We find that LMs' ICL ability depends on $textitparallel structures$ in the pre-training data.
arXiv Detail & Related papers (2024-02-19T20:40:48Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.