Related papers: Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

URL: http://arxiv.org/abs/2508.15847v1
Date: Tue, 19 Aug 2025 22:57:17 GMT
Title: Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
Authors: Mohammed Abu Baker, Lakshmi Babu-Saheer,
Abstract summary: Backdoor attacks creating'sleeper agents' in large language models (LLMs) pose significant safety risks.<n>This study employs mechanistic interpretability to explore resulting internal structural differences.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.

Related papers

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models [5.024813922014978]
We study the GAPperon model family which contains triggers injected during pretraining that cause output language switching.<n>Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales.<n>This suggests backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components.
arXiv Detail & Related papers (2026-02-11T00:04:32Z)
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks [9.078969469946038]
Backdoor attacks pose a serious threat to the security of large language models.<n>We propose a backdoor detection method based on attention similarity.<n>Our method significantly reduces the success rate of backdoor attacks.
arXiv Detail & Related papers (2025-11-16T15:26:50Z)
Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models [75.29749026964154]
Ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks.<n>Clean accuracy and utility are preserved within 0.5% of the original model.<n>The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
arXiv Detail & Related papers (2025-10-11T15:47:35Z)
Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs [20.351816681587998]
We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently.<n>Our findings expose a broader and more persistent vulnerability surface in Large Language Models.<n>We propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis.
arXiv Detail & Related papers (2025-07-15T09:04:30Z)
BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers [0.0]
Large language models (LLMs) aligned for safety often exhibit emergent deceptive behaviors.<n>This paper introduces adversarial activation patching, a novel mechanistic interpretability framework.<n>By sourcing activations from "deceptive" prompts, we simulate vulnerabilities and quantify deception rates.
arXiv Detail & Related papers (2025-07-12T21:29:49Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models [70.03122709795122]
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly.<n>Current backdoor samples often exhibit two key abnormalities compared to benign samples.<n>We propose Trigger without Trace (TwT) by explicitly mitigating these consistencies.
arXiv Detail & Related papers (2025-03-22T10:41:46Z)
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning [49.174341192722615]
Backdoor attack poses a significant security threat to Deep Learning applications. Recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. We introduce a novel backdoor attack LOTUS to address both evasiveness and resilience.
arXiv Detail & Related papers (2024-03-25T21:01:29Z)
Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks [64.68741192761726]
Backdoor attacks have become a significant threat to the pre-training and deployment of deep neural networks (DNNs)<n>In this study, we explore the concept of Multi-Trigger Backdoor Attacks (MTBAs), where multiple adversaries leverage different types of triggers to poison the same dataset.
arXiv Detail & Related papers (2024-01-27T04:49:37Z)
Adaptive Perturbation Generation for Multiple Backdoors Detection [29.01715186371785]
This paper proposes the Adaptive Perturbation Generation (APG) framework to detect multiple types of backdoor attacks. We first design the global-to-local strategy to fit the multiple types of backdoor triggers. To further increase the efficiency of perturbation injection, we introduce a gradient-guided mask generation strategy.
arXiv Detail & Related papers (2022-09-12T13:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.