GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised
Learning
- URL: http://arxiv.org/abs/2308.11605v1
- Date: Tue, 22 Aug 2023 17:53:26 GMT
- Title: GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised
Learning
- Authors: Mainak Singha, Ankit Jha, Biplab Banerjee
- Abstract summary: We propose a prompt learning-based model called GOPro to overcome challenges of CLIP's contrastive loss and SSL's loss.
GOro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner.
- Score: 14.532939492926406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale foundation models, such as CLIP, have demonstrated remarkable
success in visual recognition tasks by embedding images in a semantically rich
space. Self-supervised learning (SSL) has also shown promise in improving
visual recognition by learning invariant features. However, the combination of
CLIP with SSL is found to face challenges due to the multi-task framework that
blends CLIP's contrastive loss and SSL's loss, including difficulties with loss
weighting and inconsistency among different views of images in CLIP's output
space. To overcome these challenges, we propose a prompt learning-based model
called GOPro, which is a unified framework that ensures similarity between
various augmented views of input images in a shared image-text embedding space,
using a pair of learnable image and text projectors atop CLIP, to promote
invariance and generalizability. To automatically learn such prompts, we
leverage the visual content and style primitives extracted from pre-trained
CLIP and adapt them to the target task. In addition to CLIP's cross-domain
contrastive loss, we introduce a visual contrastive loss and a novel prompt
consistency loss, considering the different views of the images. GOPro is
trained end-to-end on all three loss objectives, combining the strengths of
CLIP and SSL in a principled manner. Empirical evaluations demonstrate that
GOPro outperforms the state-of-the-art prompting techniques on three
challenging domain generalization tasks across multiple benchmarks by a
significant margin. Our code is available at
https://github.com/mainaksingha01/GOPro.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg.
AdaB Decoder is designed to generate different image embeddings for both training and new classes.
SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing [12.930814370829893]
We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP.
Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts.
We propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features.
arXiv Detail & Related papers (2023-11-27T13:35:20Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.