TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
Estimation
- URL: http://arxiv.org/abs/2110.09554v1
- Date: Mon, 18 Oct 2021 18:08:18 GMT
- Title: TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
Estimation
- Authors: Haoyu Ma, Liangjian Chen, Deying Kong, Zhe Wang, Xingwei Liu, Hao
Tang, Xiangyi Yan, Yusheng Xie, Shih-Yao Lin, Xiaohui Xie
- Abstract summary: We introduce a transformer framework for multi-view 3D pose estimation.
Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion.
We propose the concept of epipolar field to encode 3D positional information into the transformer model.
- Score: 21.37032015978738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the 2D human poses in each view is typically the first step in
calibrated multi-view 3D pose estimation. But the performance of 2D pose
detectors suffers from challenging situations such as occlusions and oblique
viewing angles. To address these challenges, previous works derive
point-to-point correspondences between different views from epipolar geometry
and utilize the correspondences to merge prediction heatmaps or feature
representations. Instead of post-prediction merge/calibration, here we
introduce a transformer framework for multi-view 3D pose estimation, aiming at
directly improving individual 2D predictors by integrating information from
different views. Inspired by previous multi-modal transformers, we design a
unified transformer architecture, named TransFusion, to fuse cues from both
current views and neighboring views. Moreover, we propose the concept of
epipolar field to encode 3D positional information into the transformer model.
The 3D position encoding guided by the epipolar field provides an efficient way
of encoding correspondences between pixels of different views. Experiments on
Human 3.6M and Ski-Pose show that our method is more efficient and has
consistent improvements compared to other fusion methods. Specifically, we
achieve 25.8 mm MPJPE on Human 3.6M with only 5M parameters on 256 x 256
resolution.
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation [4.603321798937854]
Volumetric Transformer Pose estimator (VTP) is the first 3D transformer framework for multi-view multi-person 3D human pose estimation.
VTP aggregates features from 2D keypoints in all camera views and learns the relationships in the 3D voxel space in an end-to-end fashion.
arXiv Detail & Related papers (2022-05-25T09:26:42Z) - CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose
Estimation [24.08170512746056]
3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints.
Recent Transformer has been adopted to encode the long-range dependencies between the joints in the spatial and temporal domains.
We propose a novel pose estimation Transformer featuring rich representations of body joints critical for capturing subtle changes across frames.
arXiv Detail & Related papers (2022-03-24T23:40:11Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z) - Epipolar Transformers [39.98487207625999]
A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps.
The 2D detector is limited to solving challenging cases which could potentially be better resolved in 3D.
We propose the differentiable "epipolar transformer", which enables the 2D detector to leverage 3D-aware features to improve 2D pose estimation.
arXiv Detail & Related papers (2020-05-10T02:22:54Z) - Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A
Geometric Approach [76.10879433430466]
We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs.
It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space.
The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset.
arXiv Detail & Related papers (2020-03-25T00:26:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.