Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
- URL: http://arxiv.org/abs/2403.19926v2
- Date: Mon, 1 Apr 2024 08:52:20 GMT
- Title: Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
- Authors: Jijie He, Wenwu Yang,
- Abstract summary: We develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as asmaps and instead directly maps the input to the joint coordinates.
Our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself.
Our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods.
- Score: 0.5524804393257919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.
Related papers
- A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - Shuffled Autoregression For Motion Interpolation [53.61556200049156]
This work aims to provide a deep-learning solution for the motion task.
We propose a novel framework, referred to as emphShuffled AutoRegression, which expands the autoregression to generate in arbitrary (shuffled) order.
We also propose an approach to constructing a particular kind of dependency graph, with three stages assembled into an end-to-end spatial-temporal motion Transformer.
arXiv Detail & Related papers (2023-06-10T07:14:59Z) - Kinematics Modeling Network for Video-based Human Pose Estimation [9.506011491028891]
Estimating human poses from videos is critical in human-computer interaction.
Joints cooperate rather than move independently during human movement.
We propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints.
arXiv Detail & Related papers (2022-07-22T09:37:48Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Motion Prediction via Joint Dependency Modeling in Phase Space [40.54430409142653]
We introduce a novel convolutional neural model to leverage explicit prior knowledge of motion anatomy.
We then propose a global optimization module that learns the implicit relationships between individual joint features.
Our method is evaluated on large-scale 3D human motion benchmark datasets.
arXiv Detail & Related papers (2022-01-07T08:30:01Z) - Spatio-Temporal Joint Graph Convolutional Networks for Traffic
Forecasting [75.10017445699532]
Recent have shifted their focus towards formulating traffic forecasting as atemporal graph modeling problem.
We propose a novel approach for accurate traffic forecasting on road networks over multiple future time steps.
arXiv Detail & Related papers (2021-11-25T08:45:14Z) - Relation-Based Associative Joint Location for Human Pose Estimation in
Videos [5.237054164442403]
We design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically.
The JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses.
Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation.
arXiv Detail & Related papers (2021-07-08T04:05:23Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.