Related papers: Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

URL: http://arxiv.org/abs/2408.02946v5
Date: Fri, 27 Dec 2024 17:40:04 GMT
Title: Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
Authors: Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine,
Abstract summary: We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards.<n>We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination.<n>Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models.
Score: 4.579553472774928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Related papers

Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs [3.7913442178940318]
Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks. We demonstrate these attacks exploit inherent architectural properties of LLMs. Our work establishes poison pills as both a security threat and diagnostic tool.
arXiv Detail & Related papers (2025-02-23T06:34:55Z)
Multi-Faceted Studies on Data Poisoning can Advance LLM Development [45.53752823903236]
This paper proposes rethinking the role of data poisoning in large language models.<n>From a threat perspective, practical strategies for data poisoning attacks can help evaluate and address real safety risks.<n>From a trustworthiness perspective, data poisoning can be leveraged to build more robust LLMs.
arXiv Detail & Related papers (2025-02-20T01:19:51Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning [32.508939142492004]
We introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models.
arXiv Detail & Related papers (2024-10-11T13:50:50Z)
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
On Practical Aspects of Aggregation Defenses against Data Poisoning Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning. Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness. Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z)
On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior. We propose textitAutoPoison, an automated data poisoning pipeline. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
Exploring the Limits of Model-Targeted Indiscriminate Data Poisoning Attacks [31.339252233416477]
We introduce the notion of model poisoning reachability as a technical tool to explore the intrinsic limits of data poisoning attacks towards target parameters. We derive an easily computable threshold to establish and quantify a surprising phase transition phenomenon among popular ML models. Our work highlights the critical role played by the poisoning ratio, and sheds new insights on existing empirical results, attacks and mitigation strategies in data poisoning.
arXiv Detail & Related papers (2023-03-07T01:55:26Z)
Autoregressive Perturbations for Data Poisoning [54.205200221427994]
Data scraping from social media has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping. We introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset.
arXiv Detail & Related papers (2022-06-08T06:24:51Z)
BEAS: Blockchain Enabled Asynchronous & Secure Federated Machine Learning [0.0]
We present BEAS, the first blockchain-based framework for N-party Federated Learning. It provides strict privacy guarantees of training data using gradient pruning. Anomaly detection protocols are used to minimize the risk of data-poisoning attacks. We also define a novel protocol to prevent premature convergence in heterogeneous learning environments.
arXiv Detail & Related papers (2022-02-06T17:11:14Z)
Defening against Adversarial Denial-of-Service Attacks [0.0]
Data poisoning is one of the most relevant security threats against machine learning and data-driven technologies. We propose a new approach of detecting DoS poisoned instances. We evaluate our defence against two DoS poisoning attacks and seven datasets, and find that it reliably identifies poisoned instances.
arXiv Detail & Related papers (2021-04-14T09:52:36Z)
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data. We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label" We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z)
Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks [74.88735178536159]
Data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. We observe that data poisoning and backdoor attacks are highly sensitive to variations in the testing setup. We apply rigorous tests to determine the extent to which we should fear them.
arXiv Detail & Related papers (2020-06-22T18:34:08Z)
With Great Dispersion Comes Greater Resilience: Efficient Poisoning Attacks and Defenses for Linear Regression Models [28.680562906669216]
We analyze how attackers may interfere with the results of regression learning by poisoning datasets. Our attack, termed Nopt, can produce larger errors with the same proportion of poisoning data-points. Our new defense algorithm, termed Proda, demonstrates an increased effectiveness in reducing errors.
arXiv Detail & Related papers (2020-06-21T22:36:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.