Related papers: Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

URL: http://arxiv.org/abs/2311.17366v2
Date: Mon, 25 Dec 2023 03:54:53 GMT
Title: Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction
Authors: Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan, Yoichi Sato, Taku Komura, Wenping Wang
Abstract summary: We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations.
Score: 70.86769090545076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. While previous works focus on either recognition or prediction, we propose a generative Transformer VAE architecture to jointly capture both aspects, facilitating realistic motion prediction by leveraging the short-term hand motion and long-term action consistency observed across timestamps. To ensure faithful representation of the semantic dependency and different temporal granularity of hand pose and action, our framework is decomposed into two cascaded VAE blocks. The lower pose block models short-span poses, while the upper action block models long-span action. These are connected by a mid-level feature that represents sub-second series of hand poses. Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations of different qualities. Evaluations show that on multiple datasets, the joint modeling of recognition and prediction improves over separate solutions, and the semantic and temporal hierarchy enables long-term pose and action modeling.

Related papers

Towards Consistent Long-Term Pose Generation [0.0]
We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context.<n>Our key innovation is eliminating the need for intermediate representations or token-based generation.<n>Our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.
arXiv Detail & Related papers (2025-07-24T12:57:22Z)
Learning Time-Aware Causal Representation for Model Generalization in Evolving Domains [50.66049136093248]
We develop a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts.<n>We show that our method can yield the optimal causal predictor for each time domain.<n>Results on both synthetic and real-world datasets exhibit that SYNC can achieve superior temporal generalization performance.
arXiv Detail & Related papers (2025-06-21T14:05:37Z)
Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. We propose a novel framework that attempts to precisely align hand poses and interactions by integrating foundation model-driven 2D priors with diffusion-based interaction refinement.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos [18.37601213802529]
STDPose is a novel framework that enhances human pose estimation by learning in sparsely-labeled videos. STDPose establishes a new benchmark for both video pose propagation (i.e., propagating pose from labeled frames to unlabeled frames) and pose estimation tasks.
arXiv Detail & Related papers (2025-01-25T04:43:12Z)
Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning [41.09061877498741]
We propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions.
arXiv Detail & Related papers (2024-04-08T06:15:13Z)
Disentangled Neural Relational Inference for Interpretable Motion Prediction [38.40799770648501]
We develop a variational auto-encoder framework that integrates graph-based representations and timesequence models. Our model infers dynamic interaction graphs augmented with interpretable edge features that characterize the interactions. We validate our approach through extensive experiments on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-01-07T22:49:24Z)
A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability. We propose a Decoupled Scoupled Framework (DeST) to address the issues. DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z)
TimeTuner: Diagnosing Time Representations for Time-Series Forecasting with Counterfactual Explanations [3.8357850372472915]
This paper contributes a novel visual analytics framework, namely TimeTuner, to help analysts understand how model behaviors are associated with localized, stationarity, and correlations of time-series representations. We show that TimeTuner can help characterize time-series representations and guide the feature engineering processes.
arXiv Detail & Related papers (2023-07-19T11:40:15Z)
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation. We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z)
Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z)
Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera [79.41374930171469]
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach combines an extensive list of favorable properties, namely it is marker-less. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work.
arXiv Detail & Related papers (2021-06-15T11:39:49Z)
Unsupervised Video Decomposition using Spatio-temporal Iterative Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning. We show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z)
Consistency Guided Scene Flow Estimation [159.24395181068218]
CGSF is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video. We show that the proposed model can reliably predict disparity and scene flow in challenging imagery. It achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains.
arXiv Detail & Related papers (2020-06-19T17:28:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.