In-Context Learning Creates Task Vectors
- URL: http://arxiv.org/abs/2310.15916v1
- Date: Tue, 24 Oct 2023 15:17:14 GMT
- Title: In-Context Learning Creates Task Vectors
- Authors: Roee Hendel, Mor Geva, Amir Globerson
- Abstract summary: In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm.
Here we show that the functions learned by ICL often have a very simple structure.
We support the above claim via comprehensive experiments across a range of models and tasks.
- Score: 40.990432572831885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) in Large Language Models (LLMs) has emerged as a
powerful new learning paradigm. However, its underlying mechanism is still not
well understood. In particular, it is challenging to map it to the "standard"
machine learning framework, where one uses a training set $S$ to find a
best-fitting function $f(x)$ in some hypothesis class. Here we make progress on
this problem by showing that the functions learned by ICL often have a very
simple structure: they correspond to the transformer LLM whose only inputs are
the query $x$ and a single "task vector" calculated from the training set.
Thus, ICL can be seen as compressing $S$ into a single task vector
$\boldsymbol{\theta}(S)$ and then using this task vector to modulate the
transformer to produce the output. We support the above claim via comprehensive
experiments across a range of models and tasks.
Related papers
- Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks [5.358878931933351]
We study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks.
We show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases.
arXiv Detail & Related papers (2024-06-04T17:59:36Z) - Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data [21.242708937367865]
Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL)
This paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture.
arXiv Detail & Related papers (2024-02-01T16:39:45Z) - Metalearning with Very Few Samples Per Task [19.78398372660794]
We consider a binary classification setting where tasks are related by a shared representation.
Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task.
Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning.
arXiv Detail & Related papers (2023-12-21T16:06:44Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z) - On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP)
We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z) - On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$.
We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z) - On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.