Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
Representation Learning
- URL: http://arxiv.org/abs/2104.03135v2
- Date: Thu, 8 Apr 2021 01:03:43 GMT
- Title: Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
Representation Learning
- Authors: Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu,
Jianlong Fu
- Abstract summary: "See Out of tHe bOx" takes a whole image as input and learns vision-language representation in an end-to-end manner.
Soho achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$2$ test-P split, 6.7% accuracy on SNLI-VE test split.
- Score: 31.895442072646254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study joint learning of Convolutional Neural Network (CNN) and Transformer
for vision-language pre-training (VLPT) which aims to learn cross-modal
alignments from millions of image-text pairs. State-of-the-art approaches
extract salient image regions and align regions with words step-by-step. As
region-based visual features usually represent parts of an image, it is
challenging for existing vision-language models to fully understand the
semantics from paired natural languages. In this paper, we propose SOHO to "See
Out of tHe bOx" that takes a whole image as input, and learns vision-language
representation in an end-to-end manner. SOHO does not require bounding box
annotations which enables inference 10 times faster than region-based
approaches. In particular, SOHO learns to extract comprehensive yet compact
image features through a visual dictionary (VD) that facilitates cross-modal
understanding. VD is designed to represent consistent visual abstractions of
similar semantics. It is updated on-the-fly and utilized in our proposed
pre-training task Masked Visual Modeling (MVM). We conduct experiments on four
well-established vision-language tasks by following standard VLPT settings. In
particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text
retrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracy
on SNLI-VE test split, respectively.
Related papers
- FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - Scene Text Recognition with Image-Text Matching-guided Dictionary [17.073688809336456]
We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network.
Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space.
Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
arXiv Detail & Related papers (2023-05-08T07:47:49Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
Concepts [14.808701042367401]
We argue that the use of object detection may not be suitable for vision language pre-training.
This paper proposes a new method called X-VLM to perform multi-grained vision language pre-training'
arXiv Detail & Related papers (2021-11-16T07:55:26Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.