Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- URL: http://arxiv.org/abs/2004.06165v5
- Date: Sun, 26 Jul 2020 00:46:46 GMT
- Title: Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- Authors: Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei
Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
- Abstract summary: We propose a new learning method Oscar (Object-Semantics Aligned Pre-training)
It uses object tags detected in images as anchor points to significantly ease the learning of alignments.
We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks.
- Score: 207.52609682812147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-training methods of learning cross-modal representations on
image-text pairs are becoming popular for vision-language tasks. While existing
methods simply concatenate image region features and text features as input to
the model to be pre-trained and use self-attention to learn image-text semantic
alignments in a brute force manner, in this paper, we propose a new learning
method Oscar (Object-Semantics Aligned Pre-training), which uses object tags
detected in images as anchor points to significantly ease the learning of
alignments. Our method is motivated by the observation that the salient objects
in an image can be accurately detected, and are often mentioned in the paired
text. We pre-train an Oscar model on the public corpus of 6.5 million
text-image pairs, and fine-tune it on downstream tasks, creating new
state-of-the-arts on six well-established vision-language understanding and
generation tasks.
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Semantic-Aware Fine-Grained Correspondence [8.29030327276322]
We propose to learn semantic-aware fine-grained correspondence using image-level self-supervised methods.
We design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence.
Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks.
arXiv Detail & Related papers (2022-07-21T12:51:41Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
Concepts [14.808701042367401]
We argue that the use of object detection may not be suitable for vision language pre-training.
This paper proposes a new method called X-VLM to perform multi-grained vision language pre-training'
arXiv Detail & Related papers (2021-11-16T07:55:26Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.