Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation
- URL: http://arxiv.org/abs/2508.19999v1
- Date: Wed, 27 Aug 2025 15:59:47 GMT
- Title: Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation
- Authors: Ziniu Zhang, Zhenshuo Zhang, Dongyue Li, Lu Wang, Jennifer Dy, Hongyang R. Zhang,
- Abstract summary: Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference?<n>This problem has broad applications in prompt tuning and chain-of-thought reasoning.<n>We show that the gradient estimation procedure yields approximations of full inference with less than $mathbf1%$ error across six datasets.
- Score: 19.158395403281734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than $\mathbf{1}\%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to $\mathbf{37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by $\mathbf{11}\%$ on average.
Related papers
- MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines [23.370444162993707]
We propose an emphinductive approach to machine unlearning (MU)<n>Unlearning can be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model.<n>Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction.
arXiv Detail & Related papers (2025-10-30T08:09:37Z) - Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production [55.76222360698305]
We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token.<n>For any token, the model can request additional compute steps by emitting a don't know> output.<n>We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context.
arXiv Detail & Related papers (2025-10-13T21:07:05Z) - DISCO: Diversifying Sample Condensation for Efficient Model Evaluation [59.01400190971061]
Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact.<n>We argue that promoting diversity among samples is not essential; what matters is to select samples thatmaximise diversity in model responses.<n>Our method, $textbfDiversifying Sample Condensation (DISCO)$, selects the top-k samples with the greatest model disagreements.
arXiv Detail & Related papers (2025-10-09T08:53:59Z) - Revisiting Score Function Estimators for $k$-Subset Sampling [5.464421236280698]
We show how to efficiently compute the $k$-subset distribution's score function using a discrete Fourier transform.
The resulting estimator provides both exact samples and unbiased gradient estimates.
Experiments in feature selection show results competitive with current methods, despite weaker assumptions.
arXiv Detail & Related papers (2024-07-22T21:26:39Z) - Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond [28.651041302245538]
We present a new data selection approach based on $k$-means clustering and sampling sensitivity.
We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
arXiv Detail & Related papers (2024-02-27T09:03:43Z) - An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning [55.467047686093025]
A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning.<n>We propose a sampling scheme, textttbf mix-cd, that prioritizes rehearsal of collateral damage'' samples.<n>Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings.
arXiv Detail & Related papers (2024-02-12T22:32:12Z) - Testable Learning with Distribution Shift [9.036777309376697]
We define a new model called testable learning with distribution shift.
We obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution.
We give several positive results for learning concept classes such as halfspaces, intersections of halfspaces, and decision trees.
arXiv Detail & Related papers (2023-11-25T23:57:45Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Optimizing Active Learning for Low Annotation Budgets [6.753808772846254]
In deep learning, active learning is usually implemented as an iterative process in which successive deep models are updated via fine tuning.
We tackle this issue by using an approach inspired by transfer learning.
We introduce a novel acquisition function which exploits the iterative nature of AL process to select samples in a more robust fashion.
arXiv Detail & Related papers (2022-01-18T18:53:10Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Learning the Stein Discrepancy for Training and Evaluating Energy-Based
Models without Sampling [30.406623987492726]
We present a new method for evaluating and training unnormalized density models.
We estimate the Stein discrepancy between the data density $p(x)$ and the model density $q(x)$ defined by a vector function of the data.
This yields a novel goodness-of-fit test which outperforms existing methods on high dimensional data.
arXiv Detail & Related papers (2020-02-13T16:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.