Masked Visual Reconstruction in Language Semantic Space
- URL: http://arxiv.org/abs/2301.06958v1
- Date: Tue, 17 Jan 2023 15:32:59 GMT
- Title: Masked Visual Reconstruction in Language Semantic Space
- Authors: Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, Xiaohu Qie,
Xinggang Wang
- Abstract summary: Masked visual Reconstruction In Language semantic Space (RILS) pre-training framework is presented.
RILS transforms vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets.
Our method exhibits advanced transferability on downstream classification, detection, and segmentation.
- Score: 38.43966132249977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Both masked image modeling (MIM) and natural language supervision have
facilitated the progress of transferable visual pre-training. In this work, we
seek the synergy between two paradigms and study the emerging properties when
MIM meets natural language supervision. To this end, we present a novel masked
visual Reconstruction In Language semantic Space (RILS) pre-training framework,
in which sentence representations, encoded by the text encoder, serve as
prototypes to transform the vision-only signals into patch-sentence
probabilities as semantically meaningful MIM reconstruction targets. The vision
models can therefore capture useful components with structured information by
predicting proper semantic of masked tokens. Better visual representations
could, in turn, improve the text encoder via the image-text alignment
objective, which is essential for the effective MIM target transformation.
Extensive experimental results demonstrate that our method not only enjoys the
best of previous MIM and CLIP but also achieves further improvements on various
tasks due to their mutual benefits. RILS exhibits advanced transferability on
downstream classification, detection, and segmentation, especially for low-shot
regimes. Code will be made available at https://github.com/hustvl/RILS.
Related papers
- FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations.
We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.