LocTex: Learning Data-Efficient Visual Representations from Localized
Textual Supervision
- URL: http://arxiv.org/abs/2108.11950v1
- Date: Thu, 26 Aug 2021 17:59:07 GMT
- Title: LocTex: Learning Data-Efficient Visual Representations from Localized
Textual Supervision
- Authors: Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han
- Abstract summary: LocTex takes advantage of the low-cost localized textual annotations to reduce the annotation effort.
Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x.
- Score: 33.81468149305518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision tasks such as object detection and semantic/instance
segmentation rely on the painstaking annotation of large training datasets. In
this paper, we propose LocTex that takes advantage of the low-cost localized
textual annotations (i.e., captions and synchronized mouse-over gestures) to
reduce the annotation effort. We introduce a contrastive pre-training framework
between images and captions and propose to supervise the cross-modal attention
map with rendered mouse traces to provide coarse localization signals. Our
learned visual features capture rich semantics (from free-form captions) and
accurate localization (from mouse traces), which are very effective when
transferred to various downstream vision tasks. Compared with ImageNet
supervised pre-training, LocTex can reduce the size of the pre-training dataset
by 10x or the target dataset by 2x while achieving comparable or even improved
performance on COCO instance segmentation. When provided with the same amount
of annotations, LocTex achieves around 4% higher accuracy than the previous
state-of-the-art "vision+language" pre-training approach on the task of PASCAL
VOC image classification.
Related papers
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Weakly Supervised Vision-and-Language Pre-training with Relative
Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training.
Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - VirTex: Learning Visual Representations from Textual Annotations [25.104705278771895]
VirTex is a pretraining approach using semantically dense captions to learn visual representations.
We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks.
On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised.
arXiv Detail & Related papers (2020-06-11T17:58:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.