ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
Action Recognition
- URL: http://arxiv.org/abs/2401.11654v1
- Date: Mon, 22 Jan 2024 02:21:26 GMT
- Title: ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
Action Recognition
- Authors: Jiaming Zhou, Junwei Liang, Kun-Yu Lin, Jinrui Yang, Wei-Shi Zheng
- Abstract summary: Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions.
We propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module.
- Score: 35.08592533014102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot action recognition (ZSAR) aims to learn an alignment model between
videos and class descriptions of seen actions that is transferable to unseen
actions. The text queries (class descriptions) used in existing ZSAR works,
however, are often short action names that fail to capture the rich semantics
in the videos, leading to misalignment. With the intuition that video content
descriptions (e.g., video captions) can provide rich contextual information of
visual concepts in videos, we propose to utilize human annotated video
descriptions to enrich the semantics of the class descriptions of each action.
However, all existing action video description datasets are limited in terms of
the number of actions, the semantics of video descriptions, etc. To this end,
we collect a large-scale action video descriptions dataset named ActionHub,
which covers a total of 1,211 common actions and provides 3.6 million action
video descriptions. With the proposed ActionHub dataset, we further propose a
novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which
consists of a Dual Cross-modality Alignment module and a Cross-action
Invariance Mining module. Specifically, the Dual Cross-modality Alignment
module utilizes both action labels and video descriptions from ActionHub to
obtain rich class semantic features for feature alignment. The Cross-action
Invariance Mining module exploits a cycle-reconstruction process between the
class semantic feature spaces of seen actions and unseen actions, aiming to
guide the model to learn cross-action invariant representations. Extensive
experimental results demonstrate that our CoCo framework significantly
outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e.,
Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in
ZSAR. We will release our code, models, and the proposed ActionHub dataset.
Related papers
- FCA-RAC: First Cycle Annotated Repetitive Action Counting [30.253568218869237]
We propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC)
FCA-RAC contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count.
This technique enables the model to capture the correlation between the initial action cycle and subsequent actions.
arXiv Detail & Related papers (2024-06-18T01:12:43Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Paxion: Patching Action Knowledge in Video-Language Foundation Models [112.92853632161604]
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Recent video-language models' impressive performance on various benchmark tasks reveal their surprising deficiency (near-random performance) in action knowledge.
We propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective.
arXiv Detail & Related papers (2023-05-18T03:53:59Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - Rich Action-semantic Consistent Knowledge for Early Action Prediction [20.866206453146898]
Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos.
We partition original partial or full videos to form a new series of partial videos evolving in arbitrary progress levels.
A novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP.
arXiv Detail & Related papers (2022-01-23T03:39:31Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Elaborative Rehearsal for Zero-shot Action Recognition [36.84404523161848]
ZSAR aims to recognize target (unseen) actions without training examples.
It remains challenging to semantically represent action classes and transfer knowledge from seen data.
We propose an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal.
arXiv Detail & Related papers (2021-08-05T20:02:46Z) - Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time.
We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task.
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.