Multiple View Geometry Transformers for 3D Human Pose Estimation
- URL: http://arxiv.org/abs/2311.10983v1
- Date: Sat, 18 Nov 2023 06:32:40 GMT
- Title: Multiple View Geometry Transformers for 3D Human Pose Estimation
- Authors: Ziwei Liao, Jialiang Zhu, Chunyu Wang, Han Hu, Steven L. Waslander
- Abstract summary: We aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation.
We propose a novel hybrid model, MVGFormer, which has a series of geometric and appearance modules organized in an iterative manner.
- Score: 35.26756920323391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we aim to improve the 3D reasoning ability of Transformers in
multi-view 3D human pose estimation. Recent works have focused on end-to-end
learning-based transformer designs, which struggle to resolve geometric
information accurately, particularly during occlusion. Instead, we propose a
novel hybrid model, MVGFormer, which has a series of geometric and appearance
modules organized in an iterative manner. The geometry modules are
learning-free and handle all viewpoint-dependent 3D tasks geometrically which
notably improves the model's generalization ability. The appearance modules are
learnable and are dedicated to estimating 2D poses from image signals
end-to-end which enables them to achieve accurate estimates even when occlusion
occurs, leading to a model that is both accurate and generalizable to new
cameras and geometries. We evaluate our approach for both in-domain and
out-of-domain settings, where our model consistently outperforms
state-of-the-art methods, and especially does so by a significant margin in the
out-of-domain setting. We will release the code and models:
https://github.com/XunshanMan/MVGFormer.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers [57.46911575980854]
We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation.
Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions.
Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations.
arXiv Detail & Related papers (2024-04-19T04:51:18Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - Pixel2Mesh++: 3D Mesh Generation and Refinement from Multi-View Images [82.32776379815712]
We study the problem of shape generation in 3D mesh representation from a small number of color images with or without camera poses.
We adopt to further improve the shape quality by leveraging cross-view information with a graph convolution network.
Our model is robust to the quality of the initial mesh and the error of camera pose, and can be combined with a differentiable function for test-time optimization.
arXiv Detail & Related papers (2022-04-21T03:42:31Z) - Disentangled3D: Learning a 3D Generative Model with Disentangled
Geometry and Appearance from Monocular Images [94.49117671450531]
State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis.
In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations.
arXiv Detail & Related papers (2022-03-29T22:03:18Z) - Geometry-Free View Synthesis: Transformers and no 3D Priors [16.86600007830682]
We show that a transformer-based model can synthesize entirely novel views without any hand-engineered 3D biases.
This is achieved by (i) a global attention mechanism for implicitly learning long-range 3D correspondences between source and target views.
arXiv Detail & Related papers (2021-04-15T17:58:05Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.