Multimodal Adaptation of CLIP for Few-Shot Action Recognition
- URL: http://arxiv.org/abs/2308.01532v1
- Date: Thu, 3 Aug 2023 04:17:25 GMT
- Title: Multimodal Adaptation of CLIP for Few-Shot Action Recognition
- Authors: Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang,
Yong Liu
- Abstract summary: This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
adapters we design can combine information from video-text sources for task-orientedtemporal modeling.
Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
- Score: 42.88862774719768
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Applying large-scale pre-trained visual models like CLIP to few-shot action
recognition tasks can benefit performance and efficiency. Utilizing the
"pre-training, fine-tuning" paradigm makes it possible to avoid training a
network from scratch, which can be time-consuming and resource-intensive.
However, this method has two drawbacks. First, limited labeled samples for
few-shot action recognition necessitate minimizing the number of tunable
parameters to mitigate over-fitting, also leading to inadequate fine-tuning
that increases resource consumption and may disrupt the generalized
representation of models. Second, the video's extra-temporal dimension
challenges few-shot recognition's effective temporal modeling, while
pre-trained visual models are usually image models. This paper proposes a novel
method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
It adapts CLIP for few-shot action recognition by adding lightweight adapters,
which can minimize the number of learnable parameters and enable the model to
transfer across different tasks quickly. The adapters we design can combine
information from video-text multimodal sources for task-oriented spatiotemporal
modeling, which is fast, efficient, and has low training costs. Additionally,
based on the attention mechanism, we design a text-guided prototype
construction module that can fully utilize video-text information to enhance
the representation of video prototypes. Our MA-CLIP is plug-and-play, which can
be used in any different few-shot action recognition temporal alignment metric.
Related papers
- Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Unsupervised Prototype Adapter for Vision-Language Models [29.516767588241724]
We design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter)
Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class.
After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks.
arXiv Detail & Related papers (2023-08-22T15:28:49Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Parameter-Efficient Image-to-Video Transfer Learning [66.82811235484607]
Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance.
Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage.
We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
arXiv Detail & Related papers (2022-06-27T18:02:29Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.