Related papers: Can Gradient Descent Simulate Prompting?

Can Gradient Descent Simulate Prompting?

URL: http://arxiv.org/abs/2506.20989v1
Date: Thu, 26 Jun 2025 04:06:20 GMT
Title: Can Gradient Descent Simulate Prompting?
Authors: Eric Zhang, Leshem Choshen, Jacob Andreas,
Abstract summary: gradient updates the effects of conditioning on new information.<n> gradient descent training recovers some (and occasionally all) of prompted model performance.<n>Results suggest new avenues for long-context modeling.
Score: 56.60154660021178
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM's own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance -- showing improvement on the ``reversal curse'' tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.

Related papers

A mean teacher algorithm for unlearning of language models [5.384630221560811]
We show that the mean teacher algorithm can approximate a trajectory of a slow natural gradient descent.<n>While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem.
arXiv Detail & Related papers (2025-04-18T00:34:19Z)
ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models [0.0]
We propose an architecture to teach the model memorizing prompt during generation by synthetic gradient. We construct a dataset for experiments, and the results have demonstrated the effectiveness of our method.
arXiv Detail & Related papers (2023-11-03T15:34:02Z)
Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters. However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z)
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models [137.74524357614285]
We introduce a novel Gradient-RegulAted Meta-prompt learning framework. It helps pre-training models adapt to downstream tasks in a parameter -- and data -- efficient way. GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way.
arXiv Detail & Related papers (2023-03-12T05:03:37Z)
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers [93.9369467909176]
We explain language models as meta-optimizers and understand in-context learning as implicit finetuning. We show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. The improved performance over vanilla attention further supports our understanding from another perspective.
arXiv Detail & Related papers (2022-12-20T18:58:48Z)
Meta-Learning Fast Weight Language Models [105.66999854213724]
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently. FWLs can be applied at training time so the model learns to make good use of gradient updates.
arXiv Detail & Related papers (2022-12-05T18:37:09Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Feed-Forward On-Edge Fine-tuning Using Static Synthetic Gradient Modules [35.92284329679786]
Training deep learning models on embedded devices is typically avoided since this requires more memory, computation and power over inference. In this work, we focus on lowering the amount of memory needed for storing all activations, which are required during the backward pass to compute the gradients. We have shown that our method has comparable results to using standard backpropagation.
arXiv Detail & Related papers (2020-09-21T08:27:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.