Linearly Mapping from Image to Text Space
- URL: http://arxiv.org/abs/2209.15162v1
- Date: Fri, 30 Sep 2022 01:17:18 GMT
- Title: Linearly Mapping from Image to Text Space
- Authors: Jack Merullo, Louis Castricato, Carsten Eickhoff, Ellie Pavlick
- Abstract summary: We show that conceptual representations learned by text-only models are functionally equivalent to those learned by models trained on vision tasks.
We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining.
We find that all three encoders perform equally well at transferring visual property information to the language model.
- Score: 22.290431852705662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The extent to which text-only language models (LMs) learn to represent the
physical, non-linguistic world is an open question. Prior work has shown that
pretrained LMs can be taught to ``understand'' visual inputs when the models'
parameters are updated on image captioning tasks. We test a stronger
hypothesis: that the conceptual representations learned by text-only models are
functionally equivalent (up to a linear transformation) to those learned by
models trained on vision tasks. Specifically, we show that the image
representations from vision models can be transferred as continuous prompts to
frozen LMs by training only a single linear projection. Using these to prompt
the LM achieves competitive performance on captioning and visual question
answering tasks compared to models that tune both the image encoder and text
decoder (such as the MAGMA model). We compare three image encoders with
increasing amounts of linguistic supervision seen during pretraining: BEIT (no
linguistic information), NF-ResNET (lexical category information), and CLIP
(full natural language descriptions). We find that all three encoders perform
equally well at transferring visual property information to the language model
(e.g., whether an animal is large or small), but that image encoders pretrained
with linguistic supervision more saliently encode category information (e.g.,
distinguishing hippo vs.\ elephant) and thus perform significantly better on
benchmark language-and-vision tasks. Our results indicate that LMs encode
conceptual information structurally similarly to vision-based models, even
those that are solely trained on images.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - Is Multimodal Vision Supervision Beneficial to Language? [2.216702991322677]
Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks.
We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
arXiv Detail & Related papers (2023-02-10T02:22:44Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks.
We find these models perform close to models trained on images.
We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.