CLIP-guided Prototype Modulating for Few-shot Action Recognition
- URL: http://arxiv.org/abs/2303.02982v1
- Date: Mon, 6 Mar 2023 09:17:47 GMT
- Title: CLIP-guided Prototype Modulating for Few-shot Action Recognition
- Authors: Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli
Zhao, Nong Sang
- Abstract summary: This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
- Score: 49.11385095278407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from large-scale contrastive language-image pre-training like CLIP
has shown remarkable success in a wide range of downstream tasks recently, but
it is still under-explored on the challenging few-shot action recognition
(FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge
of CLIP to alleviate the inaccurate prototype estimation issue due to data
scarcity, which is a critical problem in low-shot regimes. To this end, we
present a CLIP-guided prototype modulating framework called CLIP-FSAR, which
consists of two key components: a video-text contrastive objective and a
prototype modulation. Specifically, the former bridges the task discrepancy
between CLIP and the few-shot video task by contrasting videos and
corresponding class text descriptions. The latter leverages the transferable
textual concepts from CLIP to adaptively refine visual prototypes with a
temporal Transformer. By this means, CLIP-FSAR can take full advantage of the
rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate
few-shot classification. Extensive experiments on five commonly used benchmarks
demonstrate the effectiveness of our proposed method, and CLIP-FSAR
significantly outperforms existing state-of-the-art methods under various
settings. The source code and models will be publicly available at
https://github.com/alibaba-mmai-research/CLIP-FSAR.
Related papers
- FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval [10.26297663751352]
Few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain.
vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance.
To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP.
arXiv Detail & Related papers (2024-11-26T14:12:14Z) - CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling [21.734200158914476]
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence.
DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces.
Experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks.
arXiv Detail & Related papers (2024-09-28T09:28:51Z) - C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.
Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias.
Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning [16.613744920566436]
Proto-CLIP is a framework for few-shot learning based on large-scale vision-language models such as CLIP.
Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples.
Proto-CLIP has both training-free and fine-tuned variants.
arXiv Detail & Related papers (2023-07-06T15:41:53Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.