Related papers: Context-Scaling versus Task-Scaling in In-Context Learning

Context-Scaling versus Task-Scaling in In-Context Learning

URL: http://arxiv.org/abs/2410.12783v1
Date: Wed, 16 Oct 2024 17:58:08 GMT
Title: Context-Scaling versus Task-Scaling in In-Context Learning
Authors: Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin,
Abstract summary: We analyze two key components of In-Context Learning (ICL): context-scaling and task-scaling. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling.
Score: 17.36757113301424
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear regression, teacher-student settings. Furthermore, a single block of our simplified transformer can be viewed as data dependent feature map followed by an MLP. This feature map on its own is a powerful predictor that is capable of context-scaling but is not capable of task-scaling. We show empirically that concatenating the output of this feature map with vectorized data as an input to MLPs enables both context-scaling and task-scaling. This finding provides a simple setting to study context and task-scaling for ICL.

Related papers

In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly [25.47694115798524]
In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates.<n>This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones.
arXiv Detail & Related papers (2025-06-24T06:33:00Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning. We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads. We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z)
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling [51.4661186662329]
We introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsicity of pre-trained models, with only a few coefficients being the learnable parameters. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives.
arXiv Detail & Related papers (2024-07-03T07:54:08Z)
Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights. In this paper, we investigate the effect of explicitly inferring task latents. We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z)
Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL) This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z)
In-Context Learning for MIMO Equalization Using Transformer-Based Sequence Models [44.161789477821536]
Large pre-trained sequence models have the capacity to carry out in-context learning (ICL) In ICL, a decision on a new input is made via a direct mapping of the input and of a few examples from the given task. We demonstrate via numerical results that transformer-based ICL has a threshold behavior.
arXiv Detail & Related papers (2023-11-10T15:09:04Z)
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models [9.340409961107955]
Transformer models have the remarkable ability to perform in-context learning (ICL) We study how effectively transformers can bridge between their pretraining data mixture. Our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases.
arXiv Detail & Related papers (2023-11-01T21:41:08Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.