MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining
- URL: http://arxiv.org/abs/2208.12262v2
- Date: Sun, 9 Apr 2023 15:59:26 GMT
- Title: MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining
- Authors: Xiaoyi Dong and Jianmin Bao and Yinglin Zheng and Ting Zhang and
Dongdong Chen and Hao Yang and Ming Zeng and Weiming Zhang and Lu Yuan and
Dong Chen and Fang Wen and Nenghai Yu
- Abstract summary: MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
- Score: 138.86293836634323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a simple yet effective framework MaskCLIP, which
incorporates a newly proposed masked self-distillation into contrastive
language-image pretraining. The core idea of masked self-distillation is to
distill representation from a full image to the representation predicted from a
masked image. Such incorporation enjoys two vital benefits. First, masked
self-distillation targets local patch representation learning, which is
complementary to vision-language contrastive focusing on text-related
representation. Second, masked self-distillation is also consistent with
vision-language contrastive from the perspective of training objective as both
utilize the visual encoder for feature aligning, and thus is able to learn
local semantics getting indirect supervision from the language. We provide
specially designed experiments with a comprehensive analysis to validate the
two benefits. Symmetrically, we also introduce the local semantic supervision
into the text branch, which further improves the pretraining performance. With
extensive experiments, we show that MaskCLIP, when applied to various
challenging downstream tasks, achieves superior results in linear probing,
finetuning, and zero-shot performance with the guidance of the language
encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.
Related papers
- Masked Visual Reconstruction in Language Semantic Space [38.43966132249977]
Masked visual Reconstruction In Language semantic Space (RILS) pre-training framework is presented.
RILS transforms vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets.
Our method exhibits advanced transferability on downstream classification, detection, and segmentation.
arXiv Detail & Related papers (2023-01-17T15:32:59Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Exploring Visual Interpretability for Contrastive Language-Image
Pre-training [23.569964756096986]
Contrastive Language-Image pre-training learns rich representations via readily available supervisions of natural language.
Visual interpretability of CLIP has not been studied yet.
We integrate above methods as Interpretable Contrastive Language-Image pre-training (ICLIP)
arXiv Detail & Related papers (2022-09-15T05:01:03Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.