Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with
Hierarchical Atomic Actions
- URL: http://arxiv.org/abs/2207.11805v1
- Date: Sun, 24 Jul 2022 20:32:24 GMT
- Title: Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with
Hierarchical Atomic Actions
- Authors: Zhi Li, Lu He, Huijuan Xu
- Abstract summary: We tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time.
We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data.
Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level.
- Score: 13.665489987620724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action understanding has evolved into the era of fine granularity, as most
human behaviors in real life have only minor differences. To detect these
fine-grained actions accurately in a label-efficient way, we tackle the problem
of weakly-supervised fine-grained temporal action detection in videos for the
first time. Without the careful design to capture subtle differences between
fine-grained actions, previous weakly-supervised models for general action
detection cannot perform well in the fine-grained setting. We propose to model
actions as the combinations of reusable atomic actions which are automatically
discovered from data through self-supervised clustering, in order to capture
the commonality and individuality of fine-grained actions. The learnt atomic
actions, represented by visual concepts, are further mapped to fine and coarse
action labels leveraging the semantic label hierarchy. Our approach constructs
a visual representation hierarchy of four levels: clip level, atomic action
level, fine action class level and coarse action class level, with supervision
at each level. Extensive experiments on two large-scale fine-grained video
datasets, FineAction and FineGym, show the benefit of our proposed
weakly-supervised model for fine-grained action detection, and it achieves
state-of-the-art results.
Related papers
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels.
We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining.
We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning Action Completeness from Points for Weakly-supervised Temporal
Action Localization [15.603643098270409]
We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training.
In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model.
arXiv Detail & Related papers (2021-08-11T04:54:39Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Semi-Supervised Few-Shot Atomic Action Recognition [59.587738451616495]
We propose a novel model for semi-supervised few-shot atomic action recognition.
Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation.
Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.
arXiv Detail & Related papers (2020-11-17T03:59:05Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.