Intra- and Inter-Action Understanding via Temporal Action Parsing
- URL: http://arxiv.org/abs/2005.10229v1
- Date: Wed, 20 May 2020 17:45:18 GMT
- Title: Intra- and Inter-Action Understanding via Temporal Action Parsing
- Authors: Dian Shao, Yue Zhao, Bo Dai and Dahua Lin
- Abstract summary: We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
- Score: 118.32912239230272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current methods for action recognition primarily rely on deep convolutional
networks to derive feature embeddings of visual and motion features. While
these methods have demonstrated remarkable performance on standard benchmarks,
we are still in need of a better understanding as to how the videos, in
particular their internal structures, relate to high-level semantics, which may
lead to benefits in multiple aspects, e.g. interpretable predictions and even
new methods that can take the recognition performances to a next level. Towards
this goal, we construct TAPOS, a new dataset developed on sport videos with
manual annotations of sub-actions, and conduct a study on temporal action
parsing on top. Our study shows that a sport activity usually consists of
multiple sub-actions and that the awareness of such temporal structures is
beneficial to action recognition. We also investigate a number of temporal
parsing methods, and thereon devise an improved method that is capable of
mining sub-actions from training data without knowing the labels of them. On
the constructed TAPOS, the proposed method is shown to reveal intra-action
information, i.e. how action instances are made of sub-actions, and
inter-action information, i.e. one specific sub-action may commonly appear in
various actions.
Related papers
- Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality
Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable.
We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.