In-context Learning and Gradient Descent Revisited
- URL: http://arxiv.org/abs/2311.07772v4
- Date: Sun, 31 Mar 2024 19:33:50 GMT
- Title: In-context Learning and Gradient Descent Revisited
- Authors: Gilad Deutch, Nadav Magar, Tomer Bar Natan, Guy Dar,
- Abstract summary: We show that even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL.
Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality.
We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
- Score: 3.085927389171139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evidence for ICL-GD correspondence on realistic NLP tasks and models. We find gaps in evaluation, both in terms of problematic metrics and insufficient baselines. We show that surprisingly, even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL. Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality. We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
Related papers
- Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning [22.341935761925892]
Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge.
This paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning.
arXiv Detail & Related papers (2024-10-07T02:12:22Z) - L^2CL: Embarrassingly Simple Layer-to-Layer Contrastive Learning for Graph Collaborative Filtering [33.165094795515785]
Graph neural networks (GNNs) have recently emerged as an effective approach to model neighborhood signals in collaborative filtering.
We propose L2CL, a principled Layer-to-Layer Contrastive Learning framework that contrasts representations from different layers.
We find that L2CL, using only one-hop contrastive learning paradigm, is able to capture intrinsic semantic structures and improve the quality of node representation.
arXiv Detail & Related papers (2024-07-19T12:45:21Z) - Surgical Feature-Space Decomposition of LLMs: Why, When and How? [8.826164604720738]
We empirically study the efficacy of weight and feature space decomposition in transformer-based language models.
We show that surgical decomposition provides critical insights into the trade-off between compression and language modelling performance.
We extend our investigation to the implications of low-rank approximations on model bias.
arXiv Detail & Related papers (2024-05-17T07:34:03Z) - On Task Performance and Model Calibration with Supervised and
Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs)
However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - Gradient constrained sharpness-aware prompt learning for vision-language
models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM)
By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness.
We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z) - Learning Deep Representations via Contrastive Learning for Instance
Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL)
In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Interventional Contrastive Learning with Meta Semantic Regularizer [28.708395209321846]
Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner.
When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas.
When the CL model is trained with foreground areas, the performance tested in full images is worse than that in foreground areas.
arXiv Detail & Related papers (2022-06-29T15:02:38Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Toward Fast, Flexible, and Robust Low-Light Image Enhancement [87.27326390675155]
We develop a new Self-Calibrated Illumination (SCI) learning framework for fast, flexible, and robust brightening images in real-world low-light scenarios.
Considering the computational burden of the cascaded pattern, we construct the self-calibrated module which realizes the convergence between results of each stage.
We make comprehensive explorations to SCI's inherent properties including operation-insensitive adaptability and model-irrelevant generality.
arXiv Detail & Related papers (2022-04-21T14:40:32Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.