Domain Aligned CLIP for Few-shot Classification
- URL: http://arxiv.org/abs/2311.09191v1
- Date: Wed, 15 Nov 2023 18:34:26 GMT
- Title: Domain Aligned CLIP for Few-shot Classification
- Authors: Muhammad Waleed Gondal, Jochen Gast, Inigo Alonso Ruiz, Richard
Droste, Tommaso Macri, Suren Kumar, Luitpold Staudigl
- Abstract summary: Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model.
We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
- Score: 3.5326413171911555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language representation learning models like CLIP have
demonstrated impressive performance for zero-shot transfer to downstream tasks
while largely benefiting from inter-modal (image-text) alignment via
contrastive objectives. This downstream performance can further be enhanced by
full-scale fine-tuning which is often compute intensive, requires large
labelled data, and can reduce out-of-distribution (OOD) robustness.
Furthermore, sole reliance on inter-modal alignment might overlook the rich
information embedded within each individual modality. In this work, we
introduce a sample-efficient domain adaptation strategy for CLIP, termed Domain
Aligned CLIP (DAC), which improves both intra-modal (image-image) and
inter-modal alignment on target distributions without fine-tuning the main
model. For intra-modal alignment, we introduce a lightweight adapter that is
specifically trained with an intra-modal contrastive objective. To improve
inter-modal alignment, we introduce a simple framework to modulate the
precomputed class text embeddings. The proposed few-shot fine-tuning framework
is computationally efficient, robust to distribution shifts, and does not alter
CLIP's parameters. We study the effectiveness of DAC by benchmarking on 11
widely used image classification tasks with consistent improvements in 16-shot
classification upon strong baselines by about 2.3% and demonstrate competitive
performance on 4 OOD robustness benchmarks.
Related papers
- CLIP Adaptation by Intra-modal Overlap Reduction [1.2277343096128712]
We analyse the intra-modal overlap in image space in terms of embedding representation.
We train a lightweight adapter on a generic set of samples from the Google Open Images dataset.
arXiv Detail & Related papers (2024-09-17T16:40:58Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger [30.758184720183106]
We propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment.
In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2%.
arXiv Detail & Related papers (2023-03-30T17:27:22Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Multi-level Consistency Learning for Semi-supervised Domain Adaptation [85.90600060675632]
Semi-supervised domain adaptation (SSDA) aims to apply knowledge learned from a fully labeled source domain to a scarcely labeled target domain.
We propose a Multi-level Consistency Learning framework for SSDA.
arXiv Detail & Related papers (2022-05-09T06:41:18Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.