Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
- URL: http://arxiv.org/abs/2510.15430v2
- Date: Mon, 20 Oct 2025 11:50:13 GMT
- Title: Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
- Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang,
- Abstract summary: We propose a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning.<n>Experiments show that our method consistently higher detection AUROC on diverse unknown attacks while improving efficiency.
- Score: 22.796169894587475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
Related papers
- ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification [47.135407245022115]
Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data.<n>We propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts.<n>Building upon these insights, we introduce ALERT, an efficient and effective zero-shot jailbreak detector.
arXiv Detail & Related papers (2026-01-07T05:30:53Z) - Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring [13.497048408038935]
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks.<n>Current anomaly-detection methods tend to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection.<n>We propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations.
arXiv Detail & Related papers (2025-12-12T22:31:38Z) - Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models [12.772312329709868]
Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks.<n>We propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection.<n>MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection.
arXiv Detail & Related papers (2025-12-03T01:40:40Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach [22.248911000455706]
We propose a novel unsupervised framework that formulates jailbreak detection as anomaly detection.<n>LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.
arXiv Detail & Related papers (2025-08-08T16:13:28Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks.<n>These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity.<n>We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z) - HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.