Multimodal Adaptation of CLIP for Few-Shot Action Recognition
- URL: http://arxiv.org/abs/2308.01532v1
- Date: Thu, 3 Aug 2023 04:17:25 GMT
- Title: Multimodal Adaptation of CLIP for Few-Shot Action Recognition
- Authors: Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang,
Yong Liu
- Abstract summary: This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
adapters we design can combine information from video-text sources for task-orientedtemporal modeling.
Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
- Score: 42.88862774719768
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Applying large-scale pre-trained visual models like CLIP to few-shot action
recognition tasks can benefit performance and efficiency. Utilizing the
"pre-training, fine-tuning" paradigm makes it possible to avoid training a
network from scratch, which can be time-consuming and resource-intensive.
However, this method has two drawbacks. First, limited labeled samples for
few-shot action recognition necessitate minimizing the number of tunable
parameters to mitigate over-fitting, also leading to inadequate fine-tuning
that increases resource consumption and may disrupt the generalized
representation of models. Second, the video's extra-temporal dimension
challenges few-shot recognition's effective temporal modeling, while
pre-trained visual models are usually image models. This paper proposes a novel
method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
It adapts CLIP for few-shot action recognition by adding lightweight adapters,
which can minimize the number of learnable parameters and enable the model to
transfer across different tasks quickly. The adapters we design can combine
information from video-text multimodal sources for task-oriented spatiotemporal
modeling, which is fast, efficient, and has low training costs. Additionally,
based on the attention mechanism, we design a text-guided prototype
construction module that can fully utilize video-text information to enhance
the representation of video prototypes. Our MA-CLIP is plug-and-play, which can
be used in any different few-shot action recognition temporal alignment metric.
Related papers
- VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation [18.680875997611025]
VP Lab is a comprehensive iterative framework that enhances visual prompting for robust segmentation model development.<n>E-PEFT is a novel ensemble of parameter-efficient fine-tuning techniques designed to adapt our visual prompting pipeline to specific domains.<n>By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50% increase in semantic segmentation mIoU performance across various technical datasets.
arXiv Detail & Related papers (2025-05-21T14:46:57Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.
Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.
Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a parameter-efficient Multi-modalpatio Ssupervised-Temporal Adapter (MSTA) to enhance alignment between textual and visual representations.
We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-Temporal learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action
Recognition [39.92547393649842]
We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges.
We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
arXiv Detail & Related papers (2024-01-22T02:03:31Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.