Joint Learning On The Hierarchy Representation for Fine-Grained Human
Action Recognition
- URL: http://arxiv.org/abs/2110.05853v1
- Date: Tue, 12 Oct 2021 09:37:51 GMT
- Title: Joint Learning On The Hierarchy Representation for Fine-Grained Human
Action Recognition
- Authors: Mei Chee Leong, Hui Li Tan, Haosong Zhang, Liyuan Li, Feng Lin, Joo
Hwee Lim
- Abstract summary: Fine-grained human action recognition is a core research topic in computer vision.
We propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction.
Our results on the FineGym dataset achieve a new state-of-the-art performance, with 91.80% Top-1 accuracy and 88.46% mean accuracy for element actions.
- Score: 13.088129408377918
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-grained human action recognition is a core research topic in computer
vision. Inspired by the recently proposed hierarchy representation of
fine-grained actions in FineGym and SlowFast network for action recognition, we
propose a novel multi-task network which exploits the FineGym hierarchy
representation to achieve effective joint learning and prediction for
fine-grained human action recognition. The multi-task network consists of three
pathways of SlowOnly networks with gradually increased frame rates for events,
sets and elements of fine-grained actions, followed by our proposed integration
layers for joint learning and prediction. It is a two-stage approach, where it
first learns deep feature representation at each hierarchical level, and is
followed by feature encoding and fusion for multi-task learning. Our empirical
results on the FineGym dataset achieve a new state-of-the-art performance, with
91.80% Top-1 accuracy and 88.46% mean accuracy for element actions, which are
3.40% and 7.26% higher than the previous best results.
Related papers
- A Multi-Task Deep Learning Approach for Sensor-based Human Activity
Recognition and Segmentation [4.987833356397567]
We propose a new deep neural network to solve the two tasks simultaneously.
The proposed network adopts selective convolution and features multiscale windows to segment activities of long or short time durations.
Our proposed method outperforms the state-of-the-art methods both for activity recognition and segmentation.
arXiv Detail & Related papers (2023-03-20T13:34:28Z) - Hierarchical Modeling for Task Recognition and Action Segmentation in
Weakly-Labeled Instructional Videos [6.187780920448871]
This paper focuses on task recognition and action segmentation in weakly-labeled instructional videos.
We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos.
We present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences.
arXiv Detail & Related papers (2021-10-12T02:32:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - First and Second Order Dynamics in a Hierarchical SOM system for Action
Recognition [0.0]
We present a novel action recognition system that employs a hierarchy of Self-Organizing Maps together with a custom supervised neural network that learns to categorize actions.
The system preprocesses the input from a Kinect like 3D camera to exploit the information not only about joint positions, but also their first and second order dynamics.
arXiv Detail & Related papers (2021-04-13T09:46:40Z) - SIMPLE: SIngle-network with Mimicking and Point Learning for Bottom-up
Human Pose Estimation [81.03485688525133]
We propose a novel multi-person pose estimation framework, SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation (SIMPLE)
Specifically, in the training process, we enable SIMPLE to mimic the pose knowledge from the high-performance top-down pipeline.
Besides, SIMPLE formulates human detection and pose estimation as a unified point learning framework to complement each other in single-network.
arXiv Detail & Related papers (2021-04-06T13:12:51Z) - Distribution Alignment: A Unified Framework for Long-tail Visual
Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition.
We then introduce a generalized re-weight method in the two-stage learning to balance the class prior.
Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z) - Knowledge Integration Networks for Action Recognition [58.548331848942865]
We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition.
We propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information.
The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%.
arXiv Detail & Related papers (2020-02-18T10:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.