Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
- URL: http://arxiv.org/abs/2512.12069v1
- Date: Fri, 12 Dec 2025 22:31:38 GMT
- Title: Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
- Authors: Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang,
- Abstract summary: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks.<n>Current anomaly-detection methods tend to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection.<n>We propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations.
- Score: 13.497048408038935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.
Related papers
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models [2.6140509675507384]
We study jailbreaking from both security and interpretability perspectives.<n>We propose a tensor-based latent representation framework that captures structure in hidden activations.<n>Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures.
arXiv Detail & Related papers (2026-02-12T02:43:17Z) - Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z) - ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification [47.135407245022115]
Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data.<n>We propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts.<n>Building upon these insights, we introduce ALERT, an efficient and effective zero-shot jailbreak detector.
arXiv Detail & Related papers (2026-01-07T05:30:53Z) - Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models [22.796169894587475]
We propose a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning.<n>Experiments show that our method consistently higher detection AUROC on diverse unknown attacks while improving efficiency.
arXiv Detail & Related papers (2025-10-17T08:37:45Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models [22.796169894587475]
We propose a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning.<n>Experiments show that our method consistently higher detection AUROC on diverse unknown attacks while improving efficiency.
arXiv Detail & Related papers (2025-08-08T16:13:28Z) - HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Prototype-supervised Adversarial Network for Targeted Attack of Deep
Hashing [65.32148145602865]
deep hashing networks are vulnerable to adversarial examples.
We propose a novel prototype-supervised adversarial network (ProS-GAN)
To the best of our knowledge, this is the first generation-based method to attack deep hashing networks.
arXiv Detail & Related papers (2021-05-17T00:31:37Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.