M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action
Recognition
- URL: http://arxiv.org/abs/2401.11649v1
- Date: Mon, 22 Jan 2024 02:03:31 GMT
- Title: M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action
Recognition
- Authors: Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei,
Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu
- Abstract summary: We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges.
We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
- Score: 39.92547393649842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the rise of large-scale vision-language pretrained models like
CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has
captured substantial attraction in video action recognition. Nevertheless,
prevailing approaches tend to prioritize strong supervised performance at the
expense of compromising the models' generalization capabilities during
transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP
adapting framework named \name to address these challenges, preserving both
high supervised performance and robust transferability. Firstly, to enhance the
individual modality architectures, we introduce multimodal adapters to both the
visual and text branches. Specifically, we design a novel visual TED-Adapter,
that performs global Temporal Enhancement and local temporal Difference
modeling to improve the temporal representation capabilities of the visual
encoder. Moreover, we adopt text encoder adapters to strengthen the learning of
semantic label information. Secondly, we design a multi-task decoder with a
rich set of supervisory signals to adeptly satisfy the need for strong
supervised performance and generalization within a multimodal framework.
Experimental results validate the efficacy of our approach, demonstrating
exceptional performance in supervised learning while maintaining strong
generalization in zero-shot scenarios.
Related papers
- Multi-layer Learnable Attention Mask for Multimodal Tasks [2.378535917357144]
Learnable Attention Mask (LAM) strategically designed to globally regulate attention maps and prioritize critical tokens.
LAM adeptly captures associations between tokens in BERT-like transformer network.
Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT.
arXiv Detail & Related papers (2024-06-04T20:28:02Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models [12.397136690734865]
We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT.
MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion.
Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
arXiv Detail & Related papers (2023-06-20T09:15:52Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.