Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning
- URL: http://arxiv.org/abs/2602.17835v1
- Date: Thu, 19 Feb 2026 20:57:30 GMT
- Title: Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning
- Authors: Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, Jingrui He,
- Abstract summary: We introduce Iprox, a framework that derives influence-preserving proxies directly from the target model.<n>Iprox consistently outperforms off-the-shelf proxies and baseline methods.
- Score: 51.87858735871145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model's influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.
Related papers
- First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation [8.788531432978802]
Training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions.<n>Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model.<n>However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers.
arXiv Detail & Related papers (2025-11-06T00:47:07Z) - Data-Efficient RLVR via Off-Policy Influence Guidance [84.60336960383867]
This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.<n>We develop textbfCurriculum textbfRL with textbfOff-textbfPolicy textInfluence guidance (textbfCROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy.
arXiv Detail & Related papers (2025-10-30T13:40:52Z) - Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z) - BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining [28.32850393150554]
BLISS is a lightweight data selection method that operates entirely emphfrom scratch, without relying on any external pretrained oracle models.<n>We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset.<n> BLISS achieves $1.7times$ speedup in reaching the same performance as the state-of-the-art method.
arXiv Detail & Related papers (2025-10-07T15:42:33Z) - Efficient Data Selection at Scale via Influence Distillation [53.03573620682107]
This paper introduces Influence Distillation, a mathematicallyjustified framework for data selection.<n>By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data.<n>Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5times$ faster selection.
arXiv Detail & Related papers (2025-05-25T09:08:00Z) - Small-to-Large Generalization: Data Influences Models Consistently Across Scale [76.87199303408161]
We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data.<n>We also characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
arXiv Detail & Related papers (2025-05-22T05:50:19Z) - Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [42.608899417822656]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations.<n>We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Tuning Language Models by Proxy [110.49482736590907]
We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning.
Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning.
arXiv Detail & Related papers (2024-01-16T18:49:55Z) - Balancing Act: Constraining Disparate Impact in Sparse Models [20.058720715290434]
We propose a constrained optimization approach that directly addresses the disparate impact of pruning.
Our formulation bounds the accuracy change between the dense and sparse models, for each sub-group.
Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups.
arXiv Detail & Related papers (2023-10-31T17:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.