Related papers: Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

URL: http://arxiv.org/abs/2303.00298v1
Date: Wed, 1 Mar 2023 07:48:01 GMT
Title: Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens
Authors: Sen Yang and Wen Heng and Gang Liu and Guozhong Luo and Wankou Yang and Gang Yu
Abstract summary: We present a novel method to estimate 3D human pose and shape from monocular videos. The proposed method attains superior performances on the 3DPW and Human3.6M datasets.
Score: 34.50928515515274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

Related papers

4DPV: 4D Pet from Videos by Coarse-to-Fine Non-Rigid Radiance Fields [16.278222277579655]
We present a coarse-to-fine neural model to recover the camera pose and the 4D reconstruction of an unknown object from multiple RGB sequences in the wild. Our approach does not consider any pre-built 3D template nor 3D training data as well as controlled conditions. We thoroughly validate the method on challenging scenarios with complex and real-world deformations.
arXiv Detail & Related papers (2024-11-15T15:31:58Z)
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images. Our model achieves real-time 3D Gaussian reconstruction during inference. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z)
AugGS: Self-augmented Gaussians with Structural Masks for Sparse-view 3D Reconstruction [9.953394373473621]
Sparse-view 3D reconstruction is a major challenge in computer vision. We propose a self-augmented two-stage Gaussian splatting framework enhanced with structural masks for sparse-view 3D reconstruction. Our approach achieves state-of-the-art performance in perceptual quality and multi-view consistency with sparse inputs.
arXiv Detail & Related papers (2024-08-09T03:09:22Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction [77.89935657608926]
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images. PF-LRM simultaneously estimates the relative camera poses in 1.3 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-11-20T18:57:55Z)
Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video [23.93644678238666]
We propose a Pose and Mesh Co-Evolution network (PMCE) to recover 3D human motion from a video. The proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency.
arXiv Detail & Related papers (2023-08-20T16:03:21Z)
IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z)
Vision Transformer for NeRF-Based View Synthesis from a Single Input Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z)
NeuralReshaper: Single-image Human-body Retouching with Deep Neural Networks [50.40798258968408]
We present NeuralReshaper, a novel method for semantic reshaping of human bodies in single images using deep generative networks. Our approach follows a fit-then-reshape pipeline, which first fits a parametric 3D human model to a source human image. To deal with the lack-of-data problem that no paired data exist, we introduce a novel self-supervised strategy to train our network.
arXiv Detail & Related papers (2022-03-20T09:02:13Z)
3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.