Controlled Caption Generation for Images Through Adversarial Attacks
- URL: http://arxiv.org/abs/2107.03050v1
- Date: Wed, 7 Jul 2021 07:22:41 GMT
- Title: Controlled Caption Generation for Images Through Adversarial Attacks
- Authors: Nayyer Aafaq, Naveed Akhtar, Wei Liu, Mubarak Shah and Ajmal Mian
- Abstract summary: We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
- Score: 85.66266989600572
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Deep learning is found to be vulnerable to adversarial examples. However, its
adversarial susceptibility in image caption generation is under-explored. We
study adversarial examples for vision and language models, which typically
adopt an encoder-decoder framework consisting of two major components: a
Convolutional Neural Network (i.e., CNN) for image feature extraction and a
Recurrent Neural Network (RNN) for caption generation. In particular, we
investigate attacks on the visual encoder's hidden layer that is fed to the
subsequent recurrent network. The existing methods either attack the
classification layer of the visual encoder or they back-propagate the gradients
from the language model. In contrast, we propose a GAN-based algorithm for
crafting adversarial examples for neural image captioning that mimics the
internal representation of the CNN such that the resulting deep features of the
input image enable a controlled incorrect caption generation through the
recurrent network. Our contribution provides new insights for understanding
adversarial attacks on vision systems with language component. The proposed
method employs two strategies for a comprehensive evaluation. The first
examines if a neural image captioning system can be misled to output targeted
image captions. The second analyzes the possibility of keywords into the
predicted captions. Experiments show that our algorithm can craft effective
adversarial images based on the CNN hidden layers to fool captioning framework.
Moreover, we discover the proposed attack to be highly transferable. Our work
leads to new robustness implications for neural image captioning.
Related papers
- A Comparative Study of Pre-trained CNNs and GRU-Based Attention for
Image Caption Generation [9.490898534790977]
This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism.
Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate sentences.
arXiv Detail & Related papers (2023-10-11T07:30:01Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models [0.0]
We present a gray-box adversarial attack on image-to-text, both untargeted and targeted.
Our attack operates in a gray-box manner, requiring no knowledge about the decoder module.
We also show that our attacks fool the popular open-source platform Hugging Face.
arXiv Detail & Related papers (2023-06-13T07:35:28Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Searching for the Essence of Adversarial Perturbations [73.96215665913797]
We show that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's erroneous prediction.
This concept of human-recognizable information allows us to explain key features related to adversarial perturbations.
arXiv Detail & Related papers (2022-05-30T18:04:57Z) - A Deep Neural Framework for Image Caption Generation Using GRU-Based
Attention Mechanism [5.855671062331371]
This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN)
On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-03T09:47:59Z) - Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks.
First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts.
Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.