Related papers: Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

URL: http://arxiv.org/abs/2503.11619v1
Date: Fri, 14 Mar 2025 17:39:45 GMT
Title: Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense
Authors: Shuyang Hao, Yiwei Wang, Bryan Hooi, Ming-Hsuan Yang, Jun Liu, Chengcheng Tang, Zi Huang, Yujun Cai,
Abstract summary: Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
Score: 90.71884758066042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.

Related papers

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models [4.27794555931853]
Vision-Language Models (VLMs) face unique vulnerabilities due to their multimodal nature, allowing adversaries to bypass safety guardrails and trigger the generation of harmful content.<n>We propose The Safety Reminder'', a soft prompt tuning approach that optimize learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness.
arXiv Detail & Related papers (2025-06-15T12:48:38Z)
Robustifying Vision-Language Models via Dynamic Token Reweighting [28.675118345987887]
Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks.<n>We present a novel inference-time defense that mitigates multimodal jailbreak attacks.<n>We introduce a new formulation of the safety-relevant distributional shift induced by the visual modality.
arXiv Detail & Related papers (2025-05-22T03:00:39Z)
Attack and defense techniques in large language models: A survey and new perspectives [5.600972861188751]
Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present security and ethical challenges.<n>This systematic survey explores the evolving landscape of attack and defense techniques in LLMs.
arXiv Detail & Related papers (2025-05-02T03:37:52Z)
Manipulating Multimodal Agents via Cross-Modal Prompt Injection [34.35145839873915]
We identify a critical yet previously overlooked security vulnerability in multimodal agents. We propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities. Our method outperforms existing injection attacks, achieving at least a +26.4% increase in attack success rates.
arXiv Detail & Related papers (2025-04-19T16:28:03Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Mind Your Questions! Towards Backdoor Attacks on Text-to-Visualization Models [21.2448592823259]
VisPoison is a framework designed to identify these vulnerabilities of text-to-vis models systematically. We show that VisPoison achieves attack success rates of over 90%, highlighting the security problem of current text-to-vis models.
arXiv Detail & Related papers (2024-10-09T11:22:03Z)
Backdooring Vision-Language Models with Out-Of-Distribution Data [44.40928756056506]
Vision-Language Models (VLMs) generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions.
arXiv Detail & Related papers (2024-10-02T06:21:00Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
Vision-language models (VLMs) offer innovative ways to combine visual and textual data for enhanced understanding and interaction. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications. We introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, to protectVLMs from the threat of patched visual prompt injectors.
arXiv Detail & Related papers (2024-05-17T04:19:19Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Attention-Based Real-Time Defenses for Physical Adversarial Attacks in Vision Applications [58.06882713631082]
Deep neural networks exhibit excellent performance in computer vision tasks, but their vulnerability to real-world adversarial attacks raises serious security concerns. This paper proposes an efficient attention-based defense mechanism that exploits adversarial channel-attention to quickly identify and track malicious objects in shallow network layers. It also introduces an efficient multi-frame defense framework, validating its efficacy through extensive experiments aimed at evaluating both defense performance and computational cost.
arXiv Detail & Related papers (2023-11-19T00:47:17Z)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses. We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.