Related papers: STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

URL: http://arxiv.org/abs/2503.17932v1
Date: Sun, 23 Mar 2025 04:23:07 GMT
Title: STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
Authors: Xunguang Wang, Wenxuan Wang, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang,
Abstract summary: Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks.<n>We present STShield, a lightweight framework for real-time jailbroken judgement.
Score: 31.35788474507371
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Related papers

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks [6.6680585862156105]
We introduce a Resilient Adaptive Defense Framework for Model Extraction Attack Protection (RADEP)<n>RADEP employs progressive adversarial training to enhance model resilience against extraction attempts.<n> Ownership verification is enforced through embedded watermarking and backdoor triggers.
arXiv Detail & Related papers (2025-05-25T23:28:05Z)
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing [62.43110639295449]
Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks.<n>Delman is a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks.<n>Delman directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility.
arXiv Detail & Related papers (2025-02-17T10:39:21Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.<n>Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks [25.212057612342218]
Large language models (LLMs) are susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs.<n>We propose a Latent-space Adversarial Training with Post-aware framework to address this problem.
arXiv Detail & Related papers (2025-01-18T02:57:12Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts.<n>We frame jailbreaks as inference-time misalignment and introduce LIAR, a fast, black-box, best-of-$N$ sampling attack requiring no training.<n>We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks [2.873719680183099]
This paper advocates for the significance of jailbreak attack prevention on Large Language Models (LLMs) We introduce MoJE, a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference.
arXiv Detail & Related papers (2024-09-26T10:12:19Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.