VirTex: Learning Visual Representations from Textual Annotations
- URL: http://arxiv.org/abs/2006.06666v3
- Date: Sat, 25 Sep 2021 23:45:16 GMT
- Title: VirTex: Learning Visual Representations from Textual Annotations
- Authors: Karan Desai, Justin Johnson
- Abstract summary: VirTex is a pretraining approach using semantically dense captions to learn visual representations.
We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks.
On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised.
- Score: 25.104705278771895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The de-facto approach to many vision tasks is to start from pretrained visual
representations, typically learned via supervised training on ImageNet. Recent
methods have explored unsupervised pretraining to scale to vast quantities of
unlabeled images. In contrast, we aim to learn high-quality visual
representations from fewer images. To this end, we revisit supervised
pretraining, and seek data-efficient alternatives to classification-based
pretraining. We propose VirTex -- a pretraining approach using semantically
dense captions to learn visual representations. We train convolutional networks
from scratch on COCO Captions, and transfer them to downstream recognition
tasks including image classification, object detection, and instance
segmentation. On all tasks, VirTex yields features that match or exceed those
learned on ImageNet -- supervised or unsupervised -- despite using up to ten
times fewer images.
Related papers
- Semantic-Aware Generation for Self-Supervised Visual Representation
Learning [116.5814634936371]
We advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image.
SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations.
We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition.
arXiv Detail & Related papers (2021-11-25T16:46:13Z) - LocTex: Learning Data-Efficient Visual Representations from Localized
Textual Supervision [33.81468149305518]
LocTex takes advantage of the low-cost localized textual annotations to reduce the annotation effort.
Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x.
arXiv Detail & Related papers (2021-08-26T17:59:07Z) - Unsupervised Object-Level Representation Learning from Scene Images [97.07686358706397]
Object-level Representation Learning (ORL) is a new self-supervised learning framework towards scene images.
Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence.
ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.
arXiv Detail & Related papers (2021-06-22T17:51:24Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z) - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [207.52609682812147]
We propose a new learning method Oscar (Object-Semantics Aligned Pre-training)
It uses object tags detected in images as anchor points to significantly ease the learning of alignments.
We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks.
arXiv Detail & Related papers (2020-04-13T19:18:10Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.