Related papers: Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

URL: http://arxiv.org/abs/2602.10382v2
Date: Thu, 12 Feb 2026 20:49:37 GMT
Title: Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Authors: Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah,
Abstract summary: We study the GAPperon model family which contains triggers injected during pretraining that cause output language switching.<n>Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales.<n>This suggests backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components.
Score: 5.024813922014978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

Related papers

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers [2.2374050209578864]
We present a practical scanner for identifying sleeper agent-style backdoors in causal language models.<n>Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples.<n>We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models.
arXiv Detail & Related papers (2026-02-03T04:17:21Z)
Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models [71.44858461725893]
Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem.<n>Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets.<n>We introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge.
arXiv Detail & Related papers (2025-11-29T06:20:00Z)
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks [9.078969469946038]
Backdoor attacks pose a serious threat to the security of large language models.<n>We propose a backdoor detection method based on attention similarity.<n>Our method significantly reduces the success rate of backdoor attacks.
arXiv Detail & Related papers (2025-11-16T15:26:50Z)
Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models [75.29749026964154]
Ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks.<n>Clean accuracy and utility are preserved within 0.5% of the original model.<n>The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
arXiv Detail & Related papers (2025-10-11T15:47:35Z)
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models [61.339966269823975]
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning.<n>Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms.<n>In this paper, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework.
arXiv Detail & Related papers (2025-09-26T01:45:25Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Mechanistic Exploration of Backdoored Large Language Model Attention Patterns [0.0]
Backdoor attacks creating'sleeper agents' in large language models (LLMs) pose significant safety risks.<n>This study employs mechanistic interpretability to explore resulting internal structural differences.
arXiv Detail & Related papers (2025-08-19T22:57:17Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs)<n>We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors.<n>We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning [49.174341192722615]
Backdoor attack poses a significant security threat to Deep Learning applications. Recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. We introduce a novel backdoor attack LOTUS to address both evasiveness and resilience.
arXiv Detail & Related papers (2024-03-25T21:01:29Z)
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models [0.0]
Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. We study the internal representations of transformer-based backdoored language models and determine early-layer modules as most important for the backdoor mechanism. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets.
arXiv Detail & Related papers (2023-02-24T05:26:08Z)
Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters. We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.