DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
- URL: http://arxiv.org/abs/2112.01518v1
- Date: Thu, 2 Dec 2021 18:59:32 GMT
- Title: DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
- Authors: Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu,
Guan Huang, Jie Zhou, Jiwen Lu
- Abstract summary: We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
- Score: 91.56988987393483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress has shown that large-scale pre-training using contrastive
image-text pairs can be a promising alternative for high-quality visual
representation learning from natural language supervision. Benefiting from a
broader source of supervision, this new paradigm exhibits impressive
transferability to downstream classification tasks and datasets. However, the
problem of transferring the knowledge learned from image-text pairs to more
complex dense prediction tasks has barely been visited. In this work, we
present a new framework for dense prediction by implicitly and explicitly
leveraging the pre-trained knowledge from CLIP. Specifically, we convert the
original image-text matching problem in CLIP to a pixel-text matching problem
and use the pixel-text score maps to guide the learning of dense prediction
models. By further using the contextual information from the image to prompt
the language model, we are able to facilitate our model to better exploit the
pre-trained knowledge. Our method is model-agnostic, which can be applied to
arbitrary dense prediction systems and various pre-trained visual backbones
including both CLIP models and ImageNet pre-trained models. Extensive
experiments demonstrate the superior performance of our methods on semantic
segmentation, object detection, and instance segmentation tasks. Code is
available at https://github.com/raoyongming/DenseCLIP
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.