Related papers: CLIP-guided Prototype Modulating for Few-shot Action Recognition

CLIP-guided Prototype Modulating for Few-shot Action Recognition

URL: http://arxiv.org/abs/2303.02982v1
Date: Mon, 6 Mar 2023 09:17:47 GMT
Title: CLIP-guided Prototype Modulating for Few-shot Action Recognition
Authors: Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, Nong Sang
Abstract summary: This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue. We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
Score: 49.11385095278407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models will be publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.

Related papers

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning [11.752632557524969]
Causal CLIP Adapter (CCA) is a novel framework that explicitly disentangles visual features extracted from CLIP.<n>Our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts.
arXiv Detail & Related papers (2025-08-05T05:30:42Z)
Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation [75.18058114915327]
Generalized Few-Shot Semanticnative (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples.<n>We propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP.<n>We show FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting.
arXiv Detail & Related papers (2025-06-28T18:36:22Z)
un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP [75.19266107565109]
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks.<n>This work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible.
arXiv Detail & Related papers (2025-05-30T12:29:38Z)
DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z)
FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval [10.26297663751352]
Few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain. vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP.
arXiv Detail & Related papers (2024-11-26T14:12:14Z)
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling [21.734200158914476]
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces. Experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks.
arXiv Detail & Related papers (2024-09-28T09:28:51Z)
C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder. Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z)
AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z)
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z)
MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z)
Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning [16.613744920566436]
Proto-CLIP is a framework for few-shot learning based on large-scale vision-language models such as CLIP. Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. Proto-CLIP has both training-free and fine-tuned variants.
arXiv Detail & Related papers (2023-07-06T15:41:53Z)
Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.