In-context Learning and Gradient Descent Revisited
- URL: http://arxiv.org/abs/2311.07772v4
- Date: Sun, 31 Mar 2024 19:33:50 GMT
- Title: In-context Learning and Gradient Descent Revisited
- Authors: Gilad Deutch, Nadav Magar, Tomer Bar Natan, Guy Dar,
- Abstract summary: We show that even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL.
Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality.
We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
- Score: 3.085927389171139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evidence for ICL-GD correspondence on realistic NLP tasks and models. We find gaps in evaluation, both in terms of problematic metrics and insufficient baselines. We show that surprisingly, even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL. Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality. We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
Related papers
- Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context [13.796664304274643]
We introduce a new framework for quantifying optimality of ICL as a learning algorithm in stylized settings.
Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context.
These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods.
arXiv Detail & Related papers (2025-02-07T00:26:45Z) - S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks.
We propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters.
Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL.
arXiv Detail & Related papers (2025-01-22T20:00:41Z) - Graph Structure Refinement with Energy-based Contrastive Learning [56.957793274727514]
We introduce an unsupervised method based on a joint of generative training and discriminative training to learn graph structure and representation.
We propose an Energy-based Contrastive Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as ECL-GSR.
ECL-GSR achieves faster training with fewer samples and memories against the leading baseline, highlighting its simplicity and efficiency in downstream tasks.
arXiv Detail & Related papers (2024-12-20T04:05:09Z) - Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning [22.341935761925892]
Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge.
This paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning.
arXiv Detail & Related papers (2024-10-07T02:12:22Z) - Surgical Feature-Space Decomposition of LLMs: Why, When and How? [8.826164604720738]
We empirically study the efficacy of weight and feature space decomposition in transformer-based language models.
We show that surgical decomposition provides critical insights into the trade-off between compression and language modelling performance.
We extend our investigation to the implications of low-rank approximations on model bias.
arXiv Detail & Related papers (2024-05-17T07:34:03Z) - On Task Performance and Model Calibration with Supervised and
Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs)
However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - Learning Deep Representations via Contrastive Learning for Instance
Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL)
In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Interventional Contrastive Learning with Meta Semantic Regularizer [28.708395209321846]
Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner.
When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas.
When the CL model is trained with foreground areas, the performance tested in full images is worse than that in foreground areas.
arXiv Detail & Related papers (2022-06-29T15:02:38Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Toward Fast, Flexible, and Robust Low-Light Image Enhancement [87.27326390675155]
We develop a new Self-Calibrated Illumination (SCI) learning framework for fast, flexible, and robust brightening images in real-world low-light scenarios.
Considering the computational burden of the cascaded pattern, we construct the self-calibrated module which realizes the convergence between results of each stage.
We make comprehensive explorations to SCI's inherent properties including operation-insensitive adaptability and model-irrelevant generality.
arXiv Detail & Related papers (2022-04-21T14:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.