OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization
- URL: http://arxiv.org/abs/2312.04403v1
- Date: Thu, 7 Dec 2023 16:16:50 GMT
- Title: OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization
- Authors: Dongchen Han, Xiaojun Jia, Yang Bai, Jindong Gu, Yang Liu, and
Xiaochun Cao
- Abstract summary: Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
- Score: 65.57380193070574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training (VLP) models demonstrate impressive abilities in
processing both images and text. However, they are vulnerable to multi-modal
adversarial examples (AEs). Investigating the generation of
high-transferability adversarial examples is crucial for uncovering VLP models'
vulnerabilities in practical scenarios. Recent works have indicated that
leveraging data augmentation and image-text modal interactions can enhance the
transferability of adversarial examples for VLP models significantly. However,
they do not consider the optimal alignment problem between dataaugmented
image-text pairs. This oversight leads to adversarial examples that are overly
tailored to the source model, thus limiting improvements in transferability. In
our research, we first explore the interplay between image sets produced
through data augmentation and their corresponding text sets. We find that
augmented image samples can align optimally with certain texts while exhibiting
less relevance to others. Motivated by this, we propose an Optimal
Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method
formulates the features of image and text sets as two distinct distributions
and employs optimal transport theory to determine the most efficient mapping
between them. This optimal mapping informs our generation of adversarial
examples to effectively counteract the overfitting issues. Extensive
experiments across various network architectures and datasets in image-text
matching tasks reveal that our OT-Attack outperforms existing state-of-the-art
methods in terms of adversarial transferability.
Related papers
- Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs)
Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process.
We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z) - A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models [7.350203999073509]
Feature Guidance Attack (FGA) is a novel method that uses text representations to direct the perturbation of clean images.
Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings.
arXiv Detail & Related papers (2024-07-25T06:10:33Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory [8.591762884862504]
Vision-language pre-training models are susceptible to multimodal adversarial examples (AEs)
We propose using diversification along the intersection region of adversarial trajectory to expand the diversity of AEs.
To further mitigate the potential overfitting, we direct the adversarial text deviating from the last intersection region along the optimization path.
arXiv Detail & Related papers (2024-03-19T05:10:10Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.