Related papers: Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

URL: http://arxiv.org/abs/2212.10559v3
Date: Mon, 15 May 2023 11:45:12 GMT
Title: Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
Authors: Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei
Abstract summary: We explain language models as meta-optimizers and understand in-context learning as implicit finetuning. We show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. The improved performance over vanilla attention further supports our understanding from another perspective.
Score: 93.9369467909176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

Related papers

Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning [22.341935761925892]
Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. This paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning.
arXiv Detail & Related papers (2024-10-07T02:12:22Z)
Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective [21.361946399192195]
In this paper, we show an exciting phenomenon that SVD-based weight pruning can enhance ICL performance. We propose a simple, model-compression and derivative-free algorithm for downstream tasks in enhancing ICL inference.
arXiv Detail & Related papers (2024-06-06T06:15:35Z)
Roles of Scaling and Instruction Tuning in Language Perception: Model vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception. Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not. We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z)
Context-Aware Meta-Learning [52.09326317432577]
We propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach exceeds or matches the state-of-the-art algorithm, P>M>F, on 8 out of 11 meta-learning benchmarks.
arXiv Detail & Related papers (2023-10-17T03:35:27Z)
Towards Foundation Models for Knowledge Graph Reasoning [18.77355708537997]
Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. We present ULTRA, an approach for learning universal and transferable graph representations. We find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs.
arXiv Detail & Related papers (2023-10-06T20:00:07Z)
Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
A Message Passing Perspective on Learning Dynamics of Contrastive Learning [60.217972614379065]
We show that if we cast a contrastive objective equivalently into the feature space, then its learning dynamics admits an interpretable form. This perspective also establishes an intriguing connection between contrastive learning and Message Passing Graph Neural Networks (MP-GNNs)
arXiv Detail & Related papers (2023-03-08T08:27:31Z)
Larger language models do in-context learning differently [93.90674531127559]
In-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels.
arXiv Detail & Related papers (2023-03-07T12:24:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.