CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
- URL: http://arxiv.org/abs/2503.10661v2
- Date: Fri, 21 Mar 2025 20:05:22 GMT
- Title: CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
- Authors: Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan,
- Abstract summary: We propose a universal certified defence framework to safeguard large vision-language models against jailbreak attacks.<n>First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses.<n>Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees.
- Score: 16.5022773312661
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.
Related papers
- Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models [54.61181161508336]
We introduce Multi-Faceted Attack (MFA), a framework that exposes general safety vulnerabilities in leading defense-equipped Vision-Language Models (VLMs)<n>The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives.<n>MFA achieves a 58.5% success rate and consistently outperforms existing methods.
arXiv Detail & Related papers (2025-11-20T07:12:54Z) - ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models [8.765213350762748]
jailbreak attacks bypass alignment safeguards to elicit harmful outputs.<n>We propose ForgeDAN, a novel framework for generating semantically coherent and highly effective adversarial prompts.<n>Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
arXiv Detail & Related papers (2025-11-17T16:19:21Z) - Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models [31.920092341939593]
Self-Calibrated Consistency is an effective test-time defense against adversarial attacks.<n> SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy.<n>These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP.
arXiv Detail & Related papers (2025-10-26T18:37:12Z) - Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack [7.988475248750045]
Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks.<n>We conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs.<n>We propose a novel two stage evaluation framework for adversarial attacks on LVLMs.
arXiv Detail & Related papers (2025-05-28T04:43:39Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.<n>This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.<n>To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - Antelope: Potent and Concealed Jailbreak Attack Strategy [7.970002819722513]
Antelope is a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models.
We successfully exploit the transferability of model-based attacks to penetrate online black-box services.
arXiv Detail & Related papers (2024-12-11T07:22:51Z) - The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? [23.347349690954452]
Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks.<n>We provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness.<n>We propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness.
arXiv Detail & Related papers (2024-10-02T11:40:49Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
Vision-language models (VLMs) offer innovative ways to combine visual and textual data for enhanced understanding and interaction.
Patch-based adversarial attack is considered the most realistic threat model in physical vision applications.
We introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, to protectVLMs from the threat of patched visual prompt injectors.
arXiv Detail & Related papers (2024-05-17T04:19:19Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.