VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models
- URL: http://arxiv.org/abs/2402.11083v1
- Date: Fri, 16 Feb 2024 21:17:42 GMT
- Title: VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models
- Authors: Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui
Chen, Ting Wang, Fenglong Ma
- Abstract summary: We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
- Score: 58.21452697997078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) is a fundamental task in computer vision and
natural language process fields. Although the ``pre-training & finetuning''
learning paradigm significantly improves the VQA performance, the adversarial
robustness of such a learning paradigm has not been explored. In this paper, we
delve into a new problem: using a pre-trained multimodal source model to create
adversarial image-text pairs and then transferring them to attack the target
VQA models. Correspondingly, we propose a novel VQAttack model, which can
iteratively generate both image and text perturbations with the designed
modules: the large language model (LLM)-enhanced image attack and the
cross-modal joint attack module. At each iteration, the LLM-enhanced image
attack module first optimizes the latent representation-based loss to generate
feature-level image perturbations. Then it incorporates an LLM to further
enhance the image perturbations by optimizing the designed masked answer
anti-recovery loss. The cross-modal joint attack module will be triggered at a
specific iteration, which updates the image and text perturbations
sequentially. Notably, the text perturbation updates are based on both the
learned gradients in the word embedding space and word synonym-based
substitution. Experimental results on two VQA datasets with five validated
models demonstrate the effectiveness of the proposed VQAttack in the
transferable attack setting, compared with state-of-the-art baselines. This
work reveals a significant blind spot in the ``pre-training & fine-tuning''
paradigm on VQA tasks. Source codes will be released.
Related papers
- VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
Pre-trained Models [46.14455492739906]
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks.
Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting.
We propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.
arXiv Detail & Related papers (2023-10-07T02:18:52Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Continual VQA for Disaster Response Systems [0.0]
Visual Question Answering (VQA) is a multi-modal task that involves answering questions from an input image.
Main challenge is the delay caused by the generation of labels in the assessment of the affected areas.
We deploy pre-trained CLIP model, which is trained on visual-image pairs.
We surpass previous state-of-the-art results on the FloodNet dataset.
arXiv Detail & Related papers (2022-09-21T12:45:51Z) - Efficient Vision-Language Pretraining with Visual Concepts and
Hierarchical Alignment [40.677139679304936]
We propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, and (c) leveraging image-level annotations.
Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding.
arXiv Detail & Related papers (2022-08-29T14:24:08Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.