Related papers: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

URL: http://arxiv.org/abs/2507.02850v2
Date: Mon, 07 Jul 2025 19:24:32 GMT
Title: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
Authors: Almog Hilel, Idan Shenfeld, Jacob Andreas, Leshem Choshen,
Abstract summary: We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
Score: 50.18141341939909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

Related papers

Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs [8.449922248196705]
We present a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used.
arXiv Detail & Related papers (2024-09-01T17:40:04Z)
Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models [8.348993615202138]
backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries.<n>We present a new poisoning paradigm against LLMs triggered by specifying generation conditions.<n>The poisoned model performs normally for output under normal/other generation conditions, while becoming harmful for output under target generation conditions.
arXiv Detail & Related papers (2024-04-23T07:19:20Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z)
Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.