Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming
- URL: http://arxiv.org/abs/2406.18501v1
- Date: Wed, 26 Jun 2024 17:06:41 GMT
- Title: Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming
- Authors: Zhenghao Zhou, Robert Frank, R. Thomas McCoy,
- Abstract summary: Large language models (LLMs) have shown the emergent capability of in-context learning (ICL)
We introduce a new way of diagnosing whether ICL is functionally equivalent to gradient-based learning.
- Score: 6.408190458163885
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown the emergent capability of in-context learning (ICL). One line of research has explained ICL as functionally performing gradient descent. In this paper, we introduce a new way of diagnosing whether ICL is functionally equivalent to gradient-based learning. Our approach is based on the inverse frequency effect (IFE) -- a phenomenon in which an error-driven learner is expected to show larger updates when trained on infrequent examples than frequent ones. The IFE has previously been studied in psycholinguistics because humans show this effect in the context of structural priming (the tendency for people to produce sentence structures they have encountered recently); the IFE has been used as evidence that human structural priming must involve error-driven learning mechanisms. In our experiments, we simulated structural priming within ICL and found that LLMs display the IFE, with the effect being stronger in larger models. We conclude that ICL is indeed a type of gradient-based learning, supporting the hypothesis that a gradient component is implicitly computed in the forward pass during ICL. Our results suggest that both humans and LLMs make use of gradient-based, error-driven processing mechanisms.
Related papers
- Can In-context Learning Really Generalize to Out-of-distribution Tasks? [36.11431280689549]
We investigate the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training.
We reveal that Transformers may struggle to learn OOD task functions through ICL.
arXiv Detail & Related papers (2024-10-13T02:10:26Z) - Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective [21.361946399192195]
In this paper, we show an exciting phenomenon that SVD-based weight pruning can enhance ICL performance.
We propose a simple, model-compression and derivative-free algorithm for downstream tasks in enhancing ICL inference.
arXiv Detail & Related papers (2024-06-06T06:15:35Z) - Quantifying Emergence in Large Language Models [31.608080868988825]
We propose a quantifiable solution for estimating emergence of LLMs.
Inspired by emergentism in dynamics, we quantify the strength of emergence by comparing the entropy reduction of the macroscopic (semantic) level with that of the microscopic (token) level.
Our method demonstrates consistent behaviors across a suite of LMs under both in-context learning (ICL) and natural sentences.
arXiv Detail & Related papers (2024-05-21T09:12:20Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - In-context Learning and Gradient Descent Revisited [3.085927389171139]
We show that even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL.
Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality.
We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
arXiv Detail & Related papers (2023-11-13T21:42:38Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - Do pretrained Transformers Learn In-Context by Gradient Descent? [21.23795112800977]
In this paper, we investigate the emergence of In-Context Learning (ICL) in language models pre-trained on natural data (LLaMa-7B)
We find that ICL and Gradient Descent (GD) modify the output distribution of language models differently.
These results indicate that emphthe equivalence between ICL and GD remains an open hypothesis and calls for further studies.
arXiv Detail & Related papers (2023-10-12T17:32:09Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge.
This study empirically evaluates the forgetting phenomenon in large language models (LLMs) during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z) - Systematic Evaluation of Causal Discovery in Visual Model Based
Reinforcement Learning [76.00395335702572]
A central goal for AI and causality is the joint discovery of abstract representations and causal structure.
Existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs.
In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them.
arXiv Detail & Related papers (2021-07-02T05:44:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.