A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video
- URL: http://arxiv.org/abs/2003.14179v4
- Date: Tue, 20 Oct 2020 01:19:08 GMT
- Title: A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video
- Authors: Junfa Liu, Juan Rojas, Zhijun Liang, Yihui Li, and Yisheng Guan
- Abstract summary: We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
- Score: 7.647599484103065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal information is key to resolve occlusion and depth ambiguity
in 3D pose estimation. Previous methods have focused on either temporal
contexts or local-to-global architectures that embed fixed-length
spatio-temporal information. To date, there have not been effective proposals
to simultaneously and flexibly capture varying spatio-temporal sequences and
effectively achieves real-time 3D pose estimation. In this work, we improve the
learning of kinematic constraints in the human skeleton: posture, local
kinematic connections, and symmetry by modeling local and global spatial
information via attention mechanisms. To adapt to single- and multi-frame
estimation, the dilated temporal model is employed to process varying skeleton
sequences. Also, importantly, we carefully design the interleaving of spatial
semantics with temporal dependencies to achieve a synergistic effect. To this
end, we propose a simple yet effective graph attention spatio-temporal
convolutional network (GAST-Net) that comprises of interleaved temporal
convolutional and graph attention blocks. Experiments on two challenging
benchmark datasets (Human3.6M and HumanEva-I) and YouTube videos demonstrate
that our approach effectively mitigates depth ambiguity and self-occlusion,
generalizes to half upper body estimation, and achieves competitive performance
on 2D-to-3D video pose estimation. Code, video, and supplementary information
is available at:
\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/}
Related papers
- STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video [7.345621536750547]
This paper presents a graph-based framework for 3D human pose estimation in video.
Specifically, we develop a graph-based attention mechanism, integrating graph information directly into the respective attention layers.
We demonstrate that our method achieves significant stateof-the-art performance in 3D human pose estimation.
arXiv Detail & Related papers (2024-07-14T06:45:27Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Spatio-temporal Tendency Reasoning for Human Body Pose and Shape
Estimation from Videos [10.50306784245168]
We present atemporal tendency reasoning (STR) network for recovering human body pose shape from videos.
Our STR aims to learn accurate and spatial motion sequences in an unconstrained environment.
Our STR remains competitive with the state-of-the-art on three datasets.
arXiv Detail & Related papers (2022-10-07T16:09:07Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking [34.40019455462043]
We propose a joint spatial-temporal optimization-based stereo 3D object tracking method.
From the network, we detect corresponding 2D bounding boxes on adjacent images and regress an initial 3D bounding box.
Dense object cues that associating to the object centroid are then predicted using a region-based network.
arXiv Detail & Related papers (2020-04-20T13:59:46Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.