Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction
- URL: http://arxiv.org/abs/2312.17106v1
- Date: Thu, 28 Dec 2023 16:30:05 GMT
- Title: Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction
- Authors: Olivier Moliner, Sangxia Huang and Kalle {\AA}str\"om
- Abstract summary: We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
- Score: 3.069335774032178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the challenges in estimating 3D human poses from multiple views
under occlusion and with limited overlapping views. We approach multi-view,
single-person 3D human pose reconstruction as a regression problem and propose
a novel encoder-decoder Transformer architecture to estimate 3D poses from
multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected
across different views and times, fusing multi-view and temporal information
through global self-attention. We enhance the encoder by incorporating a
geometry-biased attention mechanism, effectively leveraging geometric
relationships between views. Additionally, we use detection scores provided by
the 2D pose detector to further guide the encoder's attention based on the
reliability of the 2D detections. The decoder subsequently regresses the 3D
pose sequence from these refined tokens, using pre-defined queries for each
joint. To enhance the generalization of our method to unseen scenes and improve
resilience to missing joints, we implement strategies including scene
centering, synthetic views, and token dropout. We conduct extensive experiments
on three benchmark public datasets, Human3.6M, CMU Panoptic and
Occlusion-Persons. Our results demonstrate the efficacy of our approach,
particularly in occluded scenes and when few views are available, which are
traditionally challenging scenarios for triangulation-based methods.
Related papers
- GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - Unsupervised 3D Keypoint Discovery with Multi-View Geometry [104.76006413355485]
We propose an algorithm that learns to discover 3D keypoints on human bodies from multiple-view images without supervision or labels.
Our approach discovers more interpretable and accurate 3D keypoints compared to other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2022-11-23T10:25:12Z) - IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z) - A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity.
We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion.
In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - Real-Time Multi-View 3D Human Pose Estimation using Semantic Feedback to
Smart Edge Sensors [28.502280038100167]
2D joint detection for each camera view is performed locally on a dedicated embedded inference processor.
3D poses are recovered from 2D joints on a central backend, based on triangulation and a body model.
The whole pipeline is capable of real-time operation.
arXiv Detail & Related papers (2021-06-28T14:00:00Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A
Geometric Approach [76.10879433430466]
We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs.
It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space.
The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset.
arXiv Detail & Related papers (2020-03-25T00:26:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.