CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
- URL: http://arxiv.org/abs/2503.10661v2
- Date: Fri, 21 Mar 2025 20:05:22 GMT
- Title: CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
- Authors: Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan,
- Abstract summary: We propose a universal certified defence framework to safeguard large vision-language models against jailbreak attacks.<n>First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses.<n>Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees.
- Score: 16.5022773312661
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.
Related papers
- Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.<n>This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.<n>To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - Antelope: Potent and Concealed Jailbreak Attack Strategy [7.970002819722513]
Antelope is a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models.
We successfully exploit the transferability of model-based attacks to penetrate online black-box services.
arXiv Detail & Related papers (2024-12-11T07:22:51Z) - The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? [23.347349690954452]
Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks.<n>We provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness.<n>We propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness.
arXiv Detail & Related papers (2024-10-02T11:40:49Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
Vision-language models (VLMs) offer innovative ways to combine visual and textual data for enhanced understanding and interaction.
Patch-based adversarial attack is considered the most realistic threat model in physical vision applications.
We introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, to protectVLMs from the threat of patched visual prompt injectors.
arXiv Detail & Related papers (2024-05-17T04:19:19Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.