I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision
- URL: http://arxiv.org/abs/2211.09778v4
- Date: Fri, 18 Aug 2023 23:43:42 GMT
- Title: I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision
- Authors: Sophia Gu, Christopher Clark, Aniruddha Kembhavi
- Abstract summary: We produce models using only text training data on four representative tasks.
We find these models perform close to models trained on images.
We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
- Score: 32.49636188029509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many high-level skills that are required for computer vision tasks, such as
parsing questions, comparing and contrasting semantics, and writing
descriptions, are also required in other domains such as natural language
processing. In this paper, we ask whether it is possible to learn those skills
from text data and then transfer them to vision tasks without ever training on
visual training data. Key to our approach is exploiting the joint embedding
space of contrastively trained vision and language encoders. In practice, there
can be systematic differences between embedding spaces for different modalities
in contrastive models, and we analyze how these differences affect our approach
and study strategies to mitigate this concern. We produce models using only
text training data on four representative tasks: image captioning, visual
entailment, visual question answering and visual news captioning, and evaluate
them on standard benchmarks using images. We find these models perform close to
models trained on images, while surpassing prior work for captioning and visual
entailment in this text-only setting by over 9 points, and outperforming all
prior work on visual news by over 30 points. We also showcase a variety of
stylistic image captioning models that are trained using no image data and no
human-curated language data, but instead using readily-available text data from
books, the web, or language models.
Related papers
- Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.