CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
- URL: http://arxiv.org/abs/2310.14108v1
- Date: Sat, 21 Oct 2023 20:20:13 GMT
- Title: CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
- Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash
Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi,
Mohammad Rastegari, Sachin Mehta
- Abstract summary: Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
- Score: 65.47237619200442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language image pretraining (CLIP) is a standard method for
training vision-language models. While CLIP is scalable, promptable, and robust
to distribution shifts on image classification tasks, it lacks object
localization capabilities. This paper studies the following question: Can we
augment CLIP training with task-specific vision models from model zoos to
improve its visual representations? Towards this end, we leverage open-source
task-specific vision models to generate pseudo-labels for an uncurated and
noisy image-text dataset. Subsequently, we train CLIP models on these
pseudo-labels in addition to the contrastive training on image and text pairs.
This simple setup shows substantial improvements of up to 16.3% across
different vision tasks, including segmentation, detection, depth estimation,
and surface normal estimation. Importantly, these enhancements are achieved
without compromising CLIP's existing capabilities, including its proficiency in
promptable zero-shot classification.
Related papers
- Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg.
AdaB Decoder is designed to generate different image embeddings for both training and new classes.
SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World
Understanding [93.45067274442881]
This paper extends Contrastive language-image pre-training (CLIP) with multi-granularity alignment.
We develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [12.872308743619403]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.