Related papers: Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

URL: http://arxiv.org/abs/2405.18770v2
Date: Tue, 18 Mar 2025 14:32:07 GMT
Title: Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen,
Abstract summary: This work is the first to explore defense strategies against multimodal attacks in vision-language (VL) tasks.<n>We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training.<n>To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness.
Score: 9.059990548158716
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. Our experiments show that MAT can effectively be applied to different VL models and tasks to improve adversarial robustness, outperforming previous efforts. Our code will be made public upon acceptance.

Related papers

Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models [17.259725776748482]
Existing adversarial training methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness.<n>We propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images.<n> QT-AFT achieves state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets.
arXiv Detail & Related papers (2025-07-22T06:13:30Z)
Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks [15.79435346574302]
This study introduces a novel Non-Uniform Illumination (NUI) attack technique, where images are subtly altered using varying NUI masks. Experiments are conducted on widely-accepted datasets including CIFAR10, TinyImageNet, and CalTech256. Results show a significant enhancement in CNN model performance when confronted with perturbed images affected by NUI attacks.
arXiv Detail & Related papers (2024-09-05T12:14:33Z)
A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models [7.350203999073509]
Feature Guidance Attack (FGA) is a novel method that uses text representations to direct the perturbation of clean images. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings.
arXiv Detail & Related papers (2024-07-25T06:10:33Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective [42.04728834962863]
Pretrained vision-language models (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks. Recent studies reveal their vulnerability to adversarial attacks, with defenses against text-based and multimodal attacks remaining largely unexplored. This work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs.
arXiv Detail & Related papers (2024-04-30T06:34:21Z)
Few-Shot Adversarial Prompt Learning on Vision-Language Models [62.50622628004134]
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. We propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement.
arXiv Detail & Related papers (2024-03-21T18:28:43Z)
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z)
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples. We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z)
Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
Adversarial Prompt Tuning (AdvPT) is a technique to enhance the adversarial robustness of image encoders in Vision-Language Models (VLMs) We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques.
arXiv Detail & Related papers (2023-11-19T07:47:43Z)
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models [46.14455492739906]
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting. We propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.
arXiv Detail & Related papers (2023-10-07T02:18:52Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)
VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z)
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
Dual Manifold Adversarial Robustness: Defense against Lp and non-Lp Adversarial Attacks [154.31827097264264]
Adversarial training is a popular defense strategy against attack threat models with bounded Lp norms. We propose Dual Manifold Adversarial Training (DMAT) where adversarial perturbations in both latent and image spaces are used in robustifying the model. Our DMAT improves performance on normal images, and achieves comparable robustness to the standard adversarial training against Lp attacks.
arXiv Detail & Related papers (2020-09-05T06:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.