The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
- URL: http://arxiv.org/abs/2409.00787v1
- Date: Sun, 1 Sep 2024 17:40:04 GMT
- Title: The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
- Authors: Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan,
- Abstract summary: We present a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections.
Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism.
By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used.
- Score: 8.449922248196705
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1\% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.
Related papers
- DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks.
Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks.
We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Embedding-based classifiers can detect prompt injection attacks [5.820776057182452]
Large Language Models (LLMs) are vulnerable to adversarial attacks, particularly prompt injection attacks.
We propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat.
arXiv Detail & Related papers (2024-10-29T17:36:59Z) - Mind Your Questions! Towards Backdoor Attacks on Text-to-Visualization Models [21.2448592823259]
VisPoison is a framework designed to identify these vulnerabilities of text-to-vis models systematically.
We show that VisPoison achieves attack success rates of over 90%, highlighting the security problem of current text-to-vis models.
arXiv Detail & Related papers (2024-10-09T11:22:03Z) - Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models [27.397408870544453]
Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence.
A critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs.
This paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts.
arXiv Detail & Related papers (2024-08-27T08:12:08Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts [33.774939728834156]
We propose a reinforcement learning formulation of the red-teaming task that allows us to discover prompts that trigger toxic outputs from a frozen defender.
We show that our policy is capable of generating likely (low-perplexity) prompts that also trigger toxicity from GPT-2, GPT-2 XL, and TinyLlama defenders.
arXiv Detail & Related papers (2024-07-12T17:33:34Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.