Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons
- URL: http://arxiv.org/abs/2103.10480v1
- Date: Thu, 18 Mar 2021 18:56:51 GMT
- Title: Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons
- Authors: David A. Noever, Samantha E. Miller Noever
- Abstract summary: We show that contradictory text and image signals can confuse the model into choosing false (visual) options.
We show by example that the CLIP model tends to read first, look later, a phenomenon we describe as reading isn't believing.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With Open AI's publishing of their CLIP model (Contrastive Language-Image
Pre-training), multi-modal neural networks now provide accessible models that
combine reading with visual recognition. Their network offers novel ways to
probe its dual abilities to read text while classifying visual objects. This
paper demonstrates several new categories of adversarial attacks, spanning
basic typographical, conceptual, and iconographic inputs generated to fool the
model into making false or absurd classifications. We demonstrate that
contradictory text and image signals can confuse the model into choosing false
(visual) options. Like previous authors, we show by example that the CLIP model
tends to read first, look later, a phenomenon we describe as reading isn't
believing.
Related papers
- Déjà Vu? Decoding Repeated Reading from Eye Movements [1.1652979442763178]
We ask whether it is possible to automatically determine whether the reader has previously encountered a text based on their eye movement patterns.
We introduce two variants of this task and address them with considerable success using both feature-based and neural models.
We present an analysis of model performance which on the one hand yields insights on the information used by the models, and on the other hand leverages predictive modeling as an analytic tool for better characterization of the role of memory in repeated reading.
arXiv Detail & Related papers (2025-02-16T09:59:29Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Freestyle Layout-to-Image Synthesis [42.64485133926378]
In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics.
The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
arXiv Detail & Related papers (2023-03-25T09:37:41Z) - Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z) - A Computational Acquisition Model for Multimodal Word Categorization [35.82822305925811]
We present a cognitively-inspired, multimodal acquisition model, trained from image-caption pairs on naturalistic data using cross-modal self-supervision.
We show that the model learns word categories and object recognition abilities, and presents trends reminiscent of those reported in the developmental literature.
arXiv Detail & Related papers (2022-05-12T09:28:55Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - This is not the Texture you are looking for! Introducing Novel
Counterfactual Explanations for Non-Experts using Generative Adversarial
Learning [59.17685450892182]
counterfactual explanation systems try to enable a counterfactual reasoning by modifying the input image.
We present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques.
Our results show that our approach leads to significantly better results regarding mental models, explanation satisfaction, trust, emotions, and self-efficacy than two state-of-the art systems.
arXiv Detail & Related papers (2020-12-22T10:08:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.