Seeing in Flowing: Adapting CLIP for Action Recognition with Motion
Prompts Learning
- URL: http://arxiv.org/abs/2308.04828v1
- Date: Wed, 9 Aug 2023 09:33:45 GMT
- Title: Seeing in Flowing: Adapting CLIP for Action Recognition with Motion
Prompts Learning
- Authors: Qiang Wang, Junlong Du, Ke Yan, Shouhong Ding
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training.
We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method.
Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
- Score: 14.292812802621707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Contrastive Language-Image Pre-training (CLIP) has recently shown
remarkable generalization on "zero-shot" training and has applied to many
downstream tasks. We explore the adaptation of CLIP to achieve a more efficient
and generalized action recognition method. We propose that the key lies in
explicitly modeling the motion cues flowing in video frames. To that end, we
design a two-stream motion modeling block to capture motion and spatial
information at the same time. And then, the obtained motion cues are utilized
to drive a dynamic prompts learner to generate motion-aware prompts, which
contain much semantic information concerning human actions. In addition, we
propose a multimodal communication block to achieve a collaborative learning
and further improve the performance. We conduct extensive experiments on
HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most
existing state-of-the-art methods by a significant margin on "few-shot" and
"zero-shot" training. We also achieve competitive performance on "closed-set"
training with extremely few trainable parameters and additional computational
costs.
Related papers
- Text-Enhanced Zero-Shot Action Recognition: A training-free approach [13.074211474150914]
We propose Text-Enhanced Action Recognition (TEAR) for zero-shot video action recognition.
TEAR is training-free and does not require the availability of training data or extensive computational resources.
arXiv Detail & Related papers (2024-08-29T10:20:05Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control [55.81022882408587]
Temporal action abstractions, along with belief state representations, are a powerful knowledge sharing mechanism for sequential decision making.
We propose a novel view that treats inducing temporal action abstractions as a sequence compression problem.
We introduce an approach that combines continuous action quantization with byte pair encoding to learn powerful action abstractions.
arXiv Detail & Related papers (2024-02-16T04:55:09Z) - MoQuad: Motion-focused Quadruple Construction for Video Contrastive
Learning [10.41936704731324]
This paper presents a simple yet effective sample construction strategy to boost the learning of motion features in video contrastive learning.
The proposed method, dubbed Motion-focused Quadruple Construction (MoQuad), augments the instance discrimination by meticulously disturbing the appearance and motion of both the positive and negative samples.
By simply applying MoQuad to SimCLR, extensive experiments show that we achieve superior performance on downstream tasks compared to the state of the arts.
arXiv Detail & Related papers (2022-12-21T09:26:40Z) - Temporal Contrastive Learning with Curriculum [19.442685015494316]
ConCur is a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy.
We conduct experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-09-02T00:12:05Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z) - Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.