LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
- URL: http://arxiv.org/abs/2410.08211v1
- Date: Thu, 10 Oct 2024 17:59:59 GMT
- Title: LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
- Authors: Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de Charette, Loris Bazzani,
- Abstract summary: We propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains.
Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images.
Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy.
- Score: 17.925954549422567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.
Related papers
- CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections [22.32157080294386]
We propose a label-free prompt-tuning method to enhance CLIP-based image classification performance using unlabeled images.
Our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer.
arXiv Detail & Related papers (2024-11-28T19:48:54Z) - UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models [75.77651291095565]
We leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models.
Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP.
We propose Unsupervised Multi-domain Feature (UMFC) to mitigate this model bias.
arXiv Detail & Related papers (2024-11-11T12:25:02Z) - Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions [24.596929878045568]
We develop methods to train vision-language models (VLMs) with "bag-level" image-text supervision.
We use descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets.
Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance.
arXiv Detail & Related papers (2024-01-04T08:39:13Z) - LAMM: Label Alignment for Multi-Modal Prompt Learning [17.478967970736115]
We introduce an innovative label alignment method named textbfLAMM, which can adjust the category embeddings of downstream datasets through end-to-end training.
Our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios.
Our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods.
arXiv Detail & Related papers (2023-12-13T15:29:52Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Generate, Annotate, and Learn: Generative Models Advance Self-Training
and Knowledge Distillation [58.64720318755764]
Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data.
Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples.
We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data.
arXiv Detail & Related papers (2021-06-11T05:01:24Z) - Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks.
Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.