Related papers: ShadowLogic: Backdoors in Any Whitebox LLM

ShadowLogic: Backdoors in Any Whitebox LLM

URL: http://arxiv.org/abs/2511.00664v1
Date: Sat, 01 Nov 2025 19:10:08 GMT
Title: ShadowLogic: Backdoors in Any Whitebox LLM
Authors: Kasimir Schulz, Amelia Kawasaki, Leo Ring,
Abstract summary: Large language models (LLMs) are widely deployed across various applications, often with safeguards to prevent the generation of harmful or restricted content.<n>This work highlights a critical security vulnerability in computational graph-based LLM formats.<n>We introduce ShadowLogic, a method for creating a backdoor in a white-box LLM by injecting an uncensoring vector into its computational graph representation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are widely deployed across various applications, often with safeguards to prevent the generation of harmful or restricted content. However, these safeguards can be covertly bypassed through adversarial modifications to the computational graph of a model. This work highlights a critical security vulnerability in computational graph-based LLM formats, demonstrating that widely used deployment pipelines may be susceptible to obscured backdoors. We introduce ShadowLogic, a method for creating a backdoor in a white-box LLM by injecting an uncensoring vector into its computational graph representation. We set a trigger phrase that, when added to the beginning of a prompt into the LLM, applies the uncensoring vector and removes the content generation safeguards in the model. We embed trigger logic directly into the computational graph which detects the trigger phrase in a prompt. To evade detection of our backdoor, we obfuscate this logic within the graph structure, making it similar to standard model functions. Our method requires minimal alterations to model parameters, making backdoored models appear benign while retaining the ability to generate uncensored responses when activated. We successfully implement ShadowLogic in Phi-3 and Llama 3.2, using ONNX for manipulating computational graphs. Implanting the uncensoring vector achieved a >60% attack success rate for further malicious queries.

Related papers

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models [71.44858461725893]
Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem.<n>Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets.<n>We introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge.
arXiv Detail & Related papers (2025-11-29T06:20:00Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Large Language Models Can Verbatim Reproduce Long Malicious Sequences [23.0516001201445]
Backdoor attacks on machine learning models have been extensively studied.<n>This paper re-examines the concept of backdoor attacks in the context of Large Language Models.<n>We find that arbitrary responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input.
arXiv Detail & Related papers (2025-03-21T23:24:49Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
Are You Using Reliable Graph Prompts? Trojan Prompt Attacks on Graph Neural Networks [31.619718902838176]
Graph Prompt Learning (GPL) has been introduced as a promising approach that uses prompts to adapt pre-trained GNN models to specific downstream tasks without requiring fine-tuning of the entire model.<n>Existing graph backdoor attacks rely on modifying model parameters during training, but this approach is impractical in as GNN encoder parameters are frozen after pre-training.<n>We propose TGPA, a backdoor attack framework designed specifically for graph prompts.
arXiv Detail & Related papers (2024-10-17T18:57:27Z)
MEGen: Generative Backdoor into Large Language Models via Model Editing [36.67048791892558]
This paper focuses on the impact of backdoored large language models (LLMs)<n>We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks.<n>Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local parameters.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Transferring Backdoors between Large Language Models by Knowledge Distillation [2.9138150728729064]
Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs) Previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase. We propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models.
arXiv Detail & Related papers (2024-08-19T10:39:45Z)
Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers [29.554818890832887]
Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text. This paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user. We propose a decoding time defense that scales linearly with assistant response sequence length and reduces the backdoor to as low as 0.35%.
arXiv Detail & Related papers (2024-07-04T20:57:06Z)
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures. This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
Motif-Backdoor: Rethinking the Backdoor Attack on Graph Neural Networks via Motifs [1.9109292348200242]
Graph neural network (GNN) with a powerful representation capability has been widely applied to various areas, such as biological gene prediction, social recommendation, etc. Recent works have exposed that GNN is vulnerable to the backdoor attack, i.e., models trained with maliciously crafted training samples are easily fooled by patched samples. Most of the proposed studies launch the backdoor attack using a trigger that either is the randomly generated subgraph (e.g., erdHos-r'enyi backdoor) for less computational burden, or the gradient-based generative subgraph (e.g
arXiv Detail & Related papers (2022-10-25T02:12:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.