Related papers: Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

URL: http://arxiv.org/abs/2511.15992v1
Date: Thu, 20 Nov 2025 02:42:41 GMT
Title: Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
Authors: Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani,
Abstract summary: Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions.<n>Recent work by Hubinger et al. demonstrated that backdoors persist through safety training, yet no practical detection methods exist.<n>We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

Related papers

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers [2.2374050209578864]
We present a practical scanner for identifying sleeper agent-style backdoors in causal language models.<n>Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples.<n>We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models.
arXiv Detail & Related papers (2026-02-03T04:17:21Z)
Semantics-Preserving Evasion of LLM Vulnerability Detectors [14.476903104601154]
LLM-based vulnerability detectors are increasingly deployed in security-critical code review.<n>We evaluate detection-time integrity under a semantics-preserving threat model.<n>We introduce a metric of joint robustness across different attack methods/carriers.
arXiv Detail & Related papers (2026-01-30T20:54:27Z)
Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models [12.772312329709868]
Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks.<n>We propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection.<n>MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection.
arXiv Detail & Related papers (2025-12-03T01:40:40Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Runtime Backdoor Detection for Federated Learning via Representational Dissimilarity Analysis [24.56608572464567]
Federated learning (FL) trains a shared model by aggregating model updates from distributed clients.<n>The decoupling of model learning from local data makes FL highly vulnerable to backdoor attacks.<n>We propose a novel approach to detecting malicious clients in an accurate, stable, and efficient manner.
arXiv Detail & Related papers (2025-03-06T14:23:18Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification [31.38942054994932]
Federated Learning (FL) systems are susceptible to adversarial attacks.<n>RedJasper is a two-staged anomaly detection method specifically designed for real-world FL deployments.<n>It identifies suspicious activities in the first stage, then activates the second stage conditionally to further scrutinize the suspicious local models.
arXiv Detail & Related papers (2023-10-06T07:09:05Z)
Evaluating the Safety of Deep Reinforcement Learning Models using Semi-Formal Verification [81.32981236437395]
We present a semi-formal verification approach for decision-making tasks based on interval analysis. Our method obtains comparable results over standard benchmarks with respect to formal verifiers. Our approach allows to efficiently evaluate safety properties for decision-making models in practical applications.
arXiv Detail & Related papers (2020-10-19T11:18:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.