CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
- URL: http://arxiv.org/abs/2310.14108v1
- Date: Sat, 21 Oct 2023 20:20:13 GMT
- Title: CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
- Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash
Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi,
Mohammad Rastegari, Sachin Mehta
- Abstract summary: Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
- Score: 65.47237619200442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language image pretraining (CLIP) is a standard method for
training vision-language models. While CLIP is scalable, promptable, and robust
to distribution shifts on image classification tasks, it lacks object
localization capabilities. This paper studies the following question: Can we
augment CLIP training with task-specific vision models from model zoos to
improve its visual representations? Towards this end, we leverage open-source
task-specific vision models to generate pseudo-labels for an uncurated and
noisy image-text dataset. Subsequently, we train CLIP models on these
pseudo-labels in addition to the contrastive training on image and text pairs.
This simple setup shows substantial improvements of up to 16.3% across
different vision tasks, including segmentation, detection, depth estimation,
and surface normal estimation. Importantly, these enhancements are achieved
without compromising CLIP's existing capabilities, including its proficiency in
promptable zero-shot classification.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg.
AdaB Decoder is designed to generate different image embeddings for both training and new classes.
SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.