The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
- URL: http://arxiv.org/abs/2409.00787v1
- Date: Sun, 1 Sep 2024 17:40:04 GMT
- Title: The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
- Authors: Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan,
- Abstract summary: We present a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections.
Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism.
By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used.
- Score: 8.449922248196705
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1\% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.
Related papers
- Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
"Deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others.
We find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content.
arXiv Detail & Related papers (2025-02-12T11:02:59Z) - Fooling LLM graders into giving better grades through neural activity guided adversarial prompting [26.164839501935973]
We propose a systematic method to reveal such biases in AI evaluation systems.
Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.
We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
arXiv Detail & Related papers (2024-12-17T19:08:22Z) - DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks.
Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks.
We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Embedding-based classifiers can detect prompt injection attacks [5.820776057182452]
Large Language Models (LLMs) are vulnerable to adversarial attacks, particularly prompt injection attacks.
We propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat.
arXiv Detail & Related papers (2024-10-29T17:36:59Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts [31.481630330369427]
We propose a reinforcement learning formulation of red-teaming to discover prompts that both elicit toxic outputs from a defender and have low perplexity as scored by that defender.
Our policy performs competitively, producing prompts that induce defender toxicity at a rate of 2-23 times higher than baseline across model scales.
Our method creates black-box attacks with 5.4-14 times increased toxicity.
arXiv Detail & Related papers (2024-07-12T17:33:34Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.