Related papers: Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

URL: http://arxiv.org/abs/2602.09987v2
Date: Wed, 11 Feb 2026 15:41:21 GMT
Title: Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
Authors: J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis,
Abstract summary: Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents.<n>We show that Infusion can be competitive with the baseline of inserting a small number of explicit behavior examples.<n>In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned.
Score: 15.843802377872121
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

Related papers

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior [58.58249548116766]
We present an experimental recipe for studying the relationship between training data and language model (LM) behavior.<n>We outline steps for intervening on data batches and then retraining model checkpoints over that data to test hypotheses relating data to behavior.
arXiv Detail & Related papers (2025-10-16T03:22:48Z)
Distributional Training Data Attribution: What do Influence Functions Sample? [25.257922996567178]
We introduce distributional training data attribution (d-TDA)<n>The goal of d-TDA is to predict how the distribution of model outputs depends upon the dataset.<n>We find that influence functions (IFs) are'secretly distributional'
arXiv Detail & Related papers (2025-06-15T21:02:36Z)
Learning to Weight Parameters for Training Data Attribution [62.830878652285406]
We propose a method to explicitly learn parameter importance weights directly from data, without requiring annotated labels.<n>Our approach improves attribution accuracy across diverse tasks, including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.
arXiv Detail & Related papers (2025-06-06T00:32:04Z)
Small-to-Large Generalization: Data Influences Models Consistently Across Scale [76.87199303408161]
We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data.<n>We also characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
arXiv Detail & Related papers (2025-05-22T05:50:19Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem. We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z)
Unlearning Traces the Influential Training Data of Language Models [31.33791825286853]
This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. We propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets.
arXiv Detail & Related papers (2024-01-26T23:17:31Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z)
Efficient Estimation of Influence of a Training Instance [56.29080605123304]
We propose an efficient method for estimating the influence of a training instance on a neural network model. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. We demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.
arXiv Detail & Related papers (2020-12-08T04:31:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.