Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery
- URL: http://arxiv.org/abs/2311.09543v1
- Date: Thu, 16 Nov 2023 03:35:17 GMT
- Title: Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery
- Authors: Ming Chen, Yan Zhou, Weihua Jian, Pengfei Wan, Zhongyuan Wang
- Abstract summary: We propose a temporal-aware refining network (TAR) to explore temporal-aware global and local image features for accurate pose and shape recovery.
Our TAR obtains more accurate results than previous state-of-the-art methods on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
- Score: 20.566505924677013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though significant progress in human pose and shape recovery from monocular
RGB images has been made in recent years, obtaining 3D human motion with high
accuracy and temporal consistency from videos remains challenging. Existing
video-based methods tend to reconstruct human motion from global image
features, which lack detailed representation capability and limit the
reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining
Network (TAR), to synchronously explore temporal-aware global and local image
features for accurate pose and shape recovery. First, a global transformer
encoder is introduced to obtain temporal global features from static feature
sequences. Second, a bidirectional ConvGRU network takes the sequence of
high-resolution feature maps as input, and outputs temporal local feature maps
that maintain high resolution and capture the local motion of the human body.
Finally, a recurrent refinement module iteratively updates estimated SMPL
parameters by leveraging both global and local temporal information to achieve
accurate and smooth results. Extensive experiments demonstrate that our TAR
obtains more accurate results than previous state-of-the-art methods on popular
benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
Related papers
- Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer [22.940662039794603]
We propose a Spatial-Temporal Transformer network for 3D clothed human reconstruction.
A spatial transformer is employed to extract global information for normal map prediction.
The incorporation of temporal features can enhance the accuracy of input features in implicit networks.
arXiv Detail & Related papers (2024-10-21T02:40:27Z) - WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.
The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.
Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos [5.258814754543826]
We propose a novel method for temporally consistent motion estimation from a monocular video.
Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose.
Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-20T10:53:59Z) - SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes [75.9110646062442]
We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner.
Our method takes multi-view RGB videos and background images from static cameras with known camera parameters as input.
We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.
arXiv Detail & Related papers (2023-08-16T09:50:35Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized
Photography [54.36608424943729]
We show that in a ''long-burst'', forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth.
We devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion.
arXiv Detail & Related papers (2022-12-22T18:54:34Z) - Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation [18.14237514372724]
We propose a new framework to generate 3D human pose and mesh from RGB videos.
We train a two-stream temporal network based on transformer to predict SMPL parameters.
The proposed algorithm is extensively evaluated on the Human3.6 and 3DPW datasets.
arXiv Detail & Related papers (2021-10-22T10:01:13Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers [67.8628917474705]
THUNDR is a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people.
We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models.
We observe very solid 3d reconstruction performance for difficult human poses collected in the wild.
arXiv Detail & Related papers (2021-06-17T09:09:24Z) - Temporal Consistency Loss for High Resolution Textured and Clothed
3DHuman Reconstruction from Monocular Video [35.42021156572568]
We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video.
The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video.
arXiv Detail & Related papers (2021-04-19T13:04:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.