Related papers: Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

URL: http://arxiv.org/abs/2510.15430v2
Date: Mon, 20 Oct 2025 11:50:13 GMT
Title: Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang,
Abstract summary: We propose a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning.<n>Experiments show that our method consistently higher detection AUROC on diverse unknown attacks while improving efficiency.
Score: 22.796169894587475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Related papers

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification [47.135407245022115]
Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data.<n>We propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts.<n>Building upon these insights, we introduce ALERT, an efficient and effective zero-shot jailbreak detector.
arXiv Detail & Related papers (2026-01-07T05:30:53Z)
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring [13.497048408038935]
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks.<n>Current anomaly-detection methods tend to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection.<n>We propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations.
arXiv Detail & Related papers (2025-12-12T22:31:38Z)
Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models [12.772312329709868]
Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks.<n>We propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection.<n>MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection.
arXiv Detail & Related papers (2025-12-03T01:40:40Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach [22.248911000455706]
We propose a novel unsupervised framework that formulates jailbreak detection as anomaly detection.<n>LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.
arXiv Detail & Related papers (2025-08-08T16:13:28Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks.<n>These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity.<n>We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z)
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.