DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
- URL: http://arxiv.org/abs/2603.01574v1
- Date: Mon, 02 Mar 2026 08:02:47 GMT
- Title: DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
- Authors: Xiaoyi Pang, Xuanyi Hao, Pengyu Liu, Qi Luo, Song Guo, Zhibo Wang,
- Abstract summary: We introduce DualSentinel, a lightweight and unified defense framework.<n>It can accurately and promptly detect the activation of targeted attacks alongside the Large Language Models generation process.<n>It is highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost)
- Score: 23.834578989358423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM's output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at https://doi.org/10.5281/zenodo.18479273.
Related papers
- PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization [0.0]
This paper introduces a novel framework for hardening system prompts through shield appending.<n>We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks.<n>We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks.
arXiv Detail & Related papers (2025-11-20T10:25:45Z) - Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z) - ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models [23.236088751922807]
Backdoor attacks pose a significant threat to Large Language Models (LLMs)<n>Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs.<n>We propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock.
arXiv Detail & Related papers (2025-08-02T13:38:04Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts.<n>We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z) - Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks.<n>These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity.<n>We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z) - Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning [34.73320827764541]
Text-to-Image(T2I) models typically deploy safety filters to prevent the generation of sensitive images.<n>Recent jailbreaking attack methods manually design prompts for the LLM to generate adversarial prompts.<n>We propose Reason2Attack(R2A), which aims to enhance the LLM's reasoning capabilities in generating adversarial prompts.
arXiv Detail & Related papers (2025-03-23T08:40:39Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses.<n>This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception.<n>We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.