Propaganda via AI? A Study on Semantic Backdoors in Large Language Models
- URL: http://arxiv.org/abs/2504.12344v1
- Date: Tue, 15 Apr 2025 16:43:15 GMT
- Title: Propaganda via AI? A Study on Semantic Backdoors in Large Language Models
- Authors: Nay Myat Min, Long H. Pham, Yige Li, Jun Sun,
- Abstract summary: We show that semantic backdoors can be implanted with only a small poisoned corpus.<n>We introduce a black-box detection framework, RAVEN, which combines semantic entropy with cross-model consistency analysis.<n> Empirical evaluations uncover previously undetected semantic backdoors.
- Score: 7.282200564983221
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at https://github.com/NayMyatMin/RAVEN.
Related papers
- Towards Invisible Backdoor Attack on Text-to-Image Diffusion Model [70.03122709795122]
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly.<n>Current backdoor samples often exhibit two key abnormalities compared to benign samples.<n>We propose a novel Invisible Backdoor Attack (IBA) to enhance the stealthiness of backdoor samples.
arXiv Detail & Related papers (2025-03-22T10:41:46Z) - Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [31.632816425798108]
Tokenization is a necessary component within the current architecture of many language models.
We discuss how tokens and pretraining can act as a backdoor for bias and other unwanted content.
We relay evidence that the tokenization algorithm's objective function impacts the large language model's cognition.
arXiv Detail & Related papers (2024-12-14T18:18:52Z) - Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs)<n>We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors.<n>We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z) - When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z) - CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization [7.282200564983221]
Large Language Models (LLMs) are susceptible to backdoor attacks.
We introduce Internal Consistency Regularization (CROW) to address layer-wise inconsistencies caused by backdoor triggers.
CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks.
arXiv Detail & Related papers (2024-11-18T07:52:12Z) - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - SynGhost: Invisible and Universal Task-agnostic Backdoor Attack via Syntactic Transfer [22.77860269955347]
Pre-training suffers from task-agnostic backdoor attacks due to vulnerabilities in data and training mechanisms.<n>We propose $mathttSynGhost$, an invisible and universal task-agnostic backdoor attack via syntactic transfer.<n>$mathttSynGhost$ adaptively selects optimal targets based on contrastive learning, creating a uniform distribution in the pre-training space.
arXiv Detail & Related papers (2024-02-29T08:20:49Z) - Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors.
We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures.
This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.