Influence Functions for Efficient Data Selection in Reasoning
- URL: http://arxiv.org/abs/2510.06108v1
- Date: Tue, 07 Oct 2025 16:40:42 GMT
- Title: Influence Functions for Efficient Data Selection in Reasoning
- Authors: Prateek Humane, Paolo Cudrano, Daniel Z. Kaplan, Matteo Matteucci, Supriyo Chakraborty, Irina Rish,
- Abstract summary: Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets.<n>We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy.
- Score: 22.94556593981994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.
Related papers
- Closing the gap on tabular data with Fourier and Implicit Categorical Features [3.071430103942477]
We show that our proposed feature preprocessing significantly boosts the performance of deep learning models.<n>We show that our proposed feature preprocessing enables them to achieve a performance that closely matches or surpasses XGBoost.
arXiv Detail & Related papers (2026-02-26T16:40:23Z) - Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution [11.387100835483672]
Training Data Attribution (TDA) methods identify which training data drive specific behaviors, particularly unintended ones.<n>Existing approaches like influence functions are both computationally expensive and attribute based on single test examples.<n>We leverage interpretable structures within the model during the attribution.<n>We show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.
arXiv Detail & Related papers (2026-02-16T16:02:09Z) - Towards Understanding Valuable Preference Data for Large Language Model Alignment [85.38864561060088]
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons.<n>We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF)<n>To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule.
arXiv Detail & Related papers (2025-10-15T06:57:55Z) - Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE [18.616344314400244]
We show that relation extraction models struggle with unseen data, even within similar domains.<n>Our results also show that data quality, rather than lexical similarity, is key to robust transfer.
arXiv Detail & Related papers (2025-05-18T20:22:14Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.<n>We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.<n>As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.<n>We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.<n>We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - A Two-Stage Feature Selection Approach for Robust Evaluation of
Treatment Effects in High-Dimensional Observational Data [1.4710887888397084]
We propose a novel two-stage feature selection technique called, Outcome Adaptive Elastic Net (OAENet)
OAENet is explicitly designed for making robust causal inference decisions using matching techniques.
Numerical experiments on simulated data demonstrate that OAENet significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-11-27T02:54:30Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z) - Nonparametric Feature Impact and Importance [0.6123324869194193]
We give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data.
To assess quality, we show that features ranked by these definitions are competitive with existing feature selection techniques.
arXiv Detail & Related papers (2020-06-08T17:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.