FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
- URL: http://arxiv.org/abs/2411.17454v1
- Date: Tue, 26 Nov 2024 14:12:14 GMT
- Title: FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
- Authors: Jingyou Xie, Jiayi Kuang, Zhenzhou Lin, Jiarui Ouyang, Zishuo Zhao, Ying Shen,
- Abstract summary: Few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain.
vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance.
To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP.
- Score: 10.26297663751352
- License:
- Abstract: Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation in X-shot scenarios. Experimental results on four benchmark datasets show a 7%-15% improvement over state-of-the-art methods, with ablation studies demonstrating enhancement of CLIP features.
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification [3.594351309950969]
CapS-Adapter is an innovative method that harnesses both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios.
Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method.
arXiv Detail & Related papers (2024-05-26T14:50:40Z) - AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias.
Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z) - FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs [24.991684983495542]
This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations.
We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing Hilbert kernel spaces (RKHSs)
arXiv Detail & Related papers (2024-03-22T19:41:26Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension [114.85628613911713]
Large-scale pre-trained models are useful for image classification across domains.
We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC.
arXiv Detail & Related papers (2022-04-12T17:55:38Z) - One-Shot Adaptation of GAN in Just One CLIP [51.188396199083336]
We present a novel single-shot GAN adaptation method through unified CLIP space manipulations.
Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization.
We show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-03-17T13:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.