Detecting Instruction Fine-tuning Attacks on Language Models using Influence Function
- URL: http://arxiv.org/abs/2504.09026v2
- Date: Tue, 30 Sep 2025 09:06:29 GMT
- Title: Detecting Instruction Fine-tuning Attacks on Language Models using Influence Function
- Authors: Jiawei Li,
- Abstract summary: Instruction finetuning attacks pose a serious threat to large language models.<n>Poisoned data is often indistinguishable from clean data.<n>We present a detection method that requires no prior knowledge of the attack.
- Score: 6.778374627376906
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Instruction finetuning attacks pose a serious threat to large language models (LLMs) by subtly embedding poisoned examples in finetuning datasets, leading to harmful or unintended behaviors in downstream applications. Detecting such attacks is challenging because poisoned data is often indistinguishable from clean data and prior knowledge of triggers or attack strategies is rarely available. We present a detection method that requires no prior knowledge of the attack. Our approach leverages influence functions under semantic transformation: by comparing influence distributions before and after a sentiment inversion, we identify critical poison examples whose influence is strong and remain unchanged before and after inversion. We show that this method works on sentiment classification task and math reasoning task, for different language models. Removing a small set of critical poisons (about 1% of the data) restores the model performance to near-clean levels. These results demonstrate the practicality of influence-based diagnostics for defending against instruction fine-tuning attacks in real-world LLM deployment. Artifact available at https://github.com/lijiawei20161002/Poison-Detection. WARNING: This paper contains offensive data examples.
Related papers
- Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification [6.816788256267754]
We show that an adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error.<n>For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training.
arXiv Detail & Related papers (2025-08-07T17:41:33Z) - Influence Functions for Preference Dataset Pruning [0.6138671548064356]
In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets.<n>In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples.<n>We also show that gradient similarity outperforms influence functions for detecting helpful training examples.
arXiv Detail & Related papers (2025-07-18T19:43:36Z) - Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards [13.197807179926428]
Large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.<n>In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data.
arXiv Detail & Related papers (2025-05-22T15:30:00Z) - Small-to-Large Generalization: Data Influences Models Consistently Across Scale [76.87199303408161]
We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data.<n>We also characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
arXiv Detail & Related papers (2025-05-22T05:50:19Z) - Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation [95.3977252782181]
Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions.
We introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way.
arXiv Detail & Related papers (2025-04-20T09:07:10Z) - Delta-Influence: Unlearning Poisons via Influence Functions [28.557242025171963]
We introduce $Delta$-Influence, a novel approach to trace abnormal model behavior back to poisoned training data.<n>$Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points.<n>We show that $Delta$-Influence consistently achieves the best unlearning across all settings.
arXiv Detail & Related papers (2024-11-20T22:15:10Z) - PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning [32.508939142492004]
We introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning.<n>Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases.<n>We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models.
arXiv Detail & Related papers (2024-10-11T13:50:50Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Revisit, Extend, and Enhance Hessian-Free Influence Functions [26.105554752277648]
Influence functions serve as crucial tools for assessing sample influence in model interpretation, subset training set selection, and more.
In this paper, we revisit a specific, albeit effective approximation method known as Trac.
This method substitutes the inverse of the Hessian matrix with an identity matrix.
arXiv Detail & Related papers (2024-05-25T03:43:36Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.
We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.
We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - C-XGBoost: A tree boosting model for causal effect estimation [8.246161706153805]
Causal effect estimation aims at estimating the Average Treatment Effect as well as the Conditional Average Treatment Effect of a treatment to an outcome from the available data.
We propose a new causal inference model, named C-XGBoost, for the prediction of potential outcomes.
arXiv Detail & Related papers (2024-03-31T17:43:37Z) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z) - DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models [31.65198592956842]
We propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models.
Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA.
In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores.
arXiv Detail & Related papers (2023-10-02T04:59:19Z) - Studying Large Language Model Generalization with Influence Functions [29.577692176892135]
Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a sequence were added to the training set?
We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to large language models (LLMs) with up to 52 billion parameters.
We investigate generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior.
arXiv Detail & Related papers (2023-08-07T04:47:42Z) - On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of
Adversarial Robustness of Vision Models [61.68061613161187]
This paper presents CARLA-GeAR, a tool for the automatic generation of synthetic datasets for evaluating the robustness of neural models against physical adversarial patches.
The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving.
The paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world.
arXiv Detail & Related papers (2022-06-09T09:17:38Z) - Accumulative Poisoning Attacks on Real-time Data [56.96241557830253]
We show that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
arXiv Detail & Related papers (2021-06-18T08:29:53Z) - FastIF: Scalable Influence Functions for Efficient Model Interpretation
and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions.
We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time.
Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z) - Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data.
We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label"
We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z) - Explaining Black Box Predictions and Unveiling Data Artifacts through
Influence Functions [55.660255727031725]
Influence functions explain the decisions of a model by identifying influential training examples.
We conduct a comparison between influence functions and common word-saliency methods on representative tasks.
We develop a new measure based on influence functions that can reveal artifacts in training data.
arXiv Detail & Related papers (2020-05-14T00:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.