Multimodal Neurons in Pretrained Text-Only Transformers
- URL: http://arxiv.org/abs/2308.01544v2
- Date: Sun, 1 Oct 2023 23:24:13 GMT
- Title: Multimodal Neurons in Pretrained Text-Only Transformers
- Authors: Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio
Torralba
- Abstract summary: We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
- Score: 52.20828443544296
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language models demonstrate remarkable capacity to generalize representations
learned in one modality to downstream tasks in other modalities. Can we trace
this ability to individual neurons? We study the case where a frozen text
transformer is augmented with vision using a self-supervised visual encoder and
a single linear projection learned on an image-to-text task. Outputs of the
projection layer are not immediately decodable into language describing image
content; instead, we find that translation between modalities occurs deeper
within the transformer. We introduce a procedure for identifying "multimodal
neurons" that convert visual representations into corresponding text, and
decoding the concepts they inject into the model's residual stream. In a series
of experiments, we show that multimodal neurons operate on specific visual
concepts across inputs, and have a systematic causal effect on image
captioning.
Related papers
- Vision-Language Integration in Multimodal Video Transformers (Partially)
Aligns with the Brain [5.496000639803771]
We present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain.
We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities.
We show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences.
arXiv Detail & Related papers (2023-11-13T21:32:37Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic
Information [5.142858130898767]
Previous visual encoding models did not incorporate verbal semantic information, contradicting biological findings.
This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information.
Experimental results demonstrate that the proposed multimodal visual information encoding network model outperforms previous models.
arXiv Detail & Related papers (2023-08-29T09:21:48Z) - Brain encoding models based on multimodal transformers can transfer
across language and vision [60.72020004771044]
We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies.
We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
arXiv Detail & Related papers (2023-05-20T17:38:44Z) - BrainBERT: Self-supervised representation learning for intracranial
recordings [18.52962864519609]
We create a reusable Transformer, BrainBERT, for intracranial recordings bringing modern representation learning approaches to neuroscience.
Much like in NLP and speech recognition, this Transformer enables classifying complex concepts, with higher accuracy and with much less data.
In the future, far more concepts will be decodable from neural recordings by using representation learning, potentially unlocking the brain like language models unlocked language.
arXiv Detail & Related papers (2023-02-28T07:40:37Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Visio-Linguistic Brain Encoding [3.944020612420711]
We systematically explore the efficacy of image Transformers and multi-modal Transformers for brain encoding.
We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs.
The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing.
arXiv Detail & Related papers (2022-04-18T11:28:18Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.