CLIP-guided Prototype Modulating for Few-shot Action Recognition
- URL: http://arxiv.org/abs/2303.02982v1
- Date: Mon, 6 Mar 2023 09:17:47 GMT
- Title: CLIP-guided Prototype Modulating for Few-shot Action Recognition
- Authors: Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli
Zhao, Nong Sang
- Abstract summary: This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
- Score: 49.11385095278407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from large-scale contrastive language-image pre-training like CLIP
has shown remarkable success in a wide range of downstream tasks recently, but
it is still under-explored on the challenging few-shot action recognition
(FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge
of CLIP to alleviate the inaccurate prototype estimation issue due to data
scarcity, which is a critical problem in low-shot regimes. To this end, we
present a CLIP-guided prototype modulating framework called CLIP-FSAR, which
consists of two key components: a video-text contrastive objective and a
prototype modulation. Specifically, the former bridges the task discrepancy
between CLIP and the few-shot video task by contrasting videos and
corresponding class text descriptions. The latter leverages the transferable
textual concepts from CLIP to adaptively refine visual prototypes with a
temporal Transformer. By this means, CLIP-FSAR can take full advantage of the
rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate
few-shot classification. Extensive experiments on five commonly used benchmarks
demonstrate the effectiveness of our proposed method, and CLIP-FSAR
significantly outperforms existing state-of-the-art methods under various
settings. The source code and models will be publicly available at
https://github.com/alibaba-mmai-research/CLIP-FSAR.
Related papers
- AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias.
Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z) - Semantic Residual Prompts for Continual Learning [21.986800282078498]
We show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test.
Our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model.
arXiv Detail & Related papers (2024-03-11T16:23:38Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - Multimodal Adaptation of CLIP for Few-Shot Action Recognition [42.88862774719768]
This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
adapters we design can combine information from video-text sources for task-orientedtemporal modeling.
Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning [16.613744920566436]
Proto-CLIP is a framework for few-shot learning based on large-scale vision-language models such as CLIP.
Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples.
Proto-CLIP has both training-free and fine-tuned variants.
arXiv Detail & Related papers (2023-07-06T15:41:53Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.