ActionCLIP: A New Paradigm for Video Action Recognition
- URL: http://arxiv.org/abs/2109.08472v1
- Date: Fri, 17 Sep 2021 11:21:34 GMT
- Title: ActionCLIP: A New Paradigm for Video Action Recognition
- Authors: Mengmeng Wang, Jiazheng Xing and Yong Liu
- Abstract summary: We provide a new perspective on action recognition by attaching importance to the semantic information of label texts.
We propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune"
- Score: 14.961103794667341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The canonical approach to video action recognition dictates a neural model to
do a classic and standard 1-of-N majority vote task. They are trained to
predict a fixed set of predefined categories, limiting their transferable
ability on new datasets with unseen concepts. In this paper, we provide a new
perspective on action recognition by attaching importance to the semantic
information of label texts rather than simply mapping them into numbers.
Specifically, we model this task as a video-text matching problem within a
multimodal learning framework, which strengthens the video representation with
more semantic language supervision and enables our model to do zero-shot action
recognition without any further labeled data or parameters requirements.
Moreover, to handle the deficiency of label texts and make use of tremendous
web data, we propose a new paradigm based on this multimodal learning framework
for action recognition, which we dub "pre-train, prompt and fine-tune". This
paradigm first learns powerful representations from pre-training on a large
amount of web image-text or video-text data. Then it makes the action
recognition task to act more like pre-training problems via prompt engineering.
Finally, it end-to-end fine-tunes on target datasets to obtain strong
performance. We give an instantiation of the new paradigm, ActionCLIP, which
not only has superior and flexible zero-shot/few-shot transfer ability but also
reaches a top performance on general action recognition task, achieving 83.8%
top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is
available at https://github.com/sallymmx/ActionCLIP.git
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception [1.5741307755393597]
We propose a novel learning framework to train a video-based action recognition model with weak labels for frame-level perception.
For training the model using the weak labels, we propose a novel latent loss function.
We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks.
arXiv Detail & Related papers (2024-03-18T09:47:41Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Generative Action Description Prompts for Skeleton-based Action
Recognition [15.38417530693649]
We propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition.
We employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions.
Our proposed GAP method achieves noticeable improvements over various baseline models without extra cost at inference.
arXiv Detail & Related papers (2022-08-10T12:55:56Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.