Fine-Grained Semantically Aligned Vision-Language Pre-Training
- URL: http://arxiv.org/abs/2208.02515v1
- Date: Thu, 4 Aug 2022 07:51:48 GMT
- Title: Fine-Grained Semantically Aligned Vision-Language Pre-Training
- Authors: Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie,
Yueting Zhuang, Qi Tian, Siliang Tang
- Abstract summary: Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
- Score: 151.7372197904064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language pre-training has shown impressive advances in a
wide range of downstream tasks. Existing methods mainly model the cross-modal
alignment by the similarity of the global representations of images and texts,
or advanced cross-modal attention upon image and text features. However, they
fail to explicitly learn the fine-grained semantic alignment between visual
regions and textual phrases, as only global image-text alignment information is
available. In this paper, we introduce LOUPE, a fine-grained semantically
aLigned visiOn-langUage PrE-training framework, which learns fine-grained
semantic alignment from the novel perspective of game-theoretic interactions.
To efficiently compute the game-theoretic interactions, we further propose an
uncertainty-aware neural Shapley interaction learning module. Experiments show
that LOUPE achieves state-of-the-art on image-text retrieval benchmarks.
Without any object-level human annotations and fine-tuning, LOUPE achieves
competitive performance on object detection and visual grounding. More
importantly, LOUPE opens a new promising direction of learning fine-grained
semantics from large-scale raw image-text pairs.
Related papers
- NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data.
Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer.
We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z) - Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Vision-Language Pre-Training with Triple Contrastive Learning [45.80365827890119]
We propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision.
Ours is the first work that takes into account local structure information for multi-modality representation learning.
arXiv Detail & Related papers (2022-02-21T17:54:57Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [207.52609682812147]
We propose a new learning method Oscar (Object-Semantics Aligned Pre-training)
It uses object tags detected in images as anchor points to significantly ease the learning of alignments.
We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks.
arXiv Detail & Related papers (2020-04-13T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.