Generative Hierarchical Temporal Transformer for Hand Action Recognition
and Motion Prediction
- URL: http://arxiv.org/abs/2311.17366v2
- Date: Mon, 25 Dec 2023 03:54:53 GMT
- Title: Generative Hierarchical Temporal Transformer for Hand Action Recognition
and Motion Prediction
- Authors: Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan, Yoichi Sato,
Taku Komura, Wenping Wang
- Abstract summary: We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction.
Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations.
- Score: 70.86769090545076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel framework that concurrently tackles hand action
recognition and 3D future hand motion prediction. While previous works focus on
either recognition or prediction, we propose a generative Transformer VAE
architecture to jointly capture both aspects, facilitating realistic motion
prediction by leveraging the short-term hand motion and long-term action
consistency observed across timestamps. To ensure faithful representation of
the semantic dependency and different temporal granularity of hand pose and
action, our framework is decomposed into two cascaded VAE blocks. The lower
pose block models short-span poses, while the upper action block models
long-span action. These are connected by a mid-level feature that represents
sub-second series of hand poses. Our framework is trained across multiple
datasets, where pose and action blocks are trained separately to fully utilize
pose-action annotations of different qualities. Evaluations show that on
multiple datasets, the joint modeling of recognition and prediction improves
over separate solutions, and the semantic and temporal hierarchy enables
long-term pose and action modeling.
Related papers
- DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video [16.32910684198013]
We present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts.
To be specific, we design a multi-stage entangled learning sequences conditioned on multi-stage differences to derive informative motion representation sequences.
These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark HiEve.
arXiv Detail & Related papers (2023-03-15T09:29:03Z) - Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action
Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation.
We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation.
Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z) - Weakly-supervised Action Transition Learning for Stochastic Human Motion
Prediction [81.94175022575966]
We introduce the task of action-driven human motion prediction.
It aims to predict multiple plausible future motions given a sequence of action labels and a short motion history.
arXiv Detail & Related papers (2022-05-31T08:38:07Z) - Investigating Pose Representations and Motion Contexts Modeling for 3D
Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task.
We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction.
Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z) - Generating Smooth Pose Sequences for Diverse Human Motion Prediction [90.45823619796674]
We introduce a unified deep generative network for both diverse and controllable motion prediction.
Our experiments on two standard benchmark datasets, Human3.6M and HumanEva-I, demonstrate that our approach outperforms the state-of-the-art baselines in terms of both sample diversity and accuracy.
arXiv Detail & Related papers (2021-08-19T00:58:00Z) - Conditional Temporal Variational AutoEncoder for Action Video Prediction [66.63038712306606]
ACT-VAE predicts pose sequences for an action clips from a single input image.
When connected with a plug-and-play Pose-to-Image (P2I) network, ACT-VAE can synthesize image sequences.
arXiv Detail & Related papers (2021-08-12T10:59:23Z) - RAIN: Reinforced Hybrid Attention Inference Network for Motion
Forecasting [34.54878390622877]
We propose a generic motion forecasting framework with dynamic key information selection and ranking based on a hybrid attention mechanism.
The framework is instantiated to handle multi-agent trajectory prediction and human motion forecasting tasks.
We validate the framework on both synthetic simulations and motion forecasting benchmarks in different domains.
arXiv Detail & Related papers (2021-08-03T06:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.