Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in
Autonomous Driving
- URL: http://arxiv.org/abs/2112.12141v1
- Date: Wed, 22 Dec 2021 18:57:16 GMT
- Title: Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in
Autonomous Driving
- Authors: Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song,
Charles R. Qi, Ting Liu, Visesh Chari, Andre Cornman, Yin Zhou, Congcong Li,
Dragomir Anguelov
- Abstract summary: 3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors.
Data collected for other use cases (such as virtual reality, gaming, and animation) may not be usable for AV applications.
We propose one of the first approaches to alleviate this problem in the AV setting.
- Score: 74.74519047735916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other
use cases in many factors, including the 3D resolution and range of data,
absence of dense depth maps, failure modes for LiDAR, relative location between
the camera and LiDAR, and a high bar for estimation accuracy. Data collected
for other use cases (such as virtual reality, gaming, and animation) may
therefore not be usable for AV applications. This necessitates the collection
and annotation of a large amount of 3D data for HPE in AV, which is
time-consuming and expensive. In this paper, we propose one of the first
approaches to alleviate this problem in the AV setting. Specifically, we
propose a multi-modal approach which uses 2D labels on RGB images as weak
supervision to perform 3D HPE. The proposed multi-modal architecture
incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On
the Waymo Open Dataset, our approach achieves a 22% relative improvement over
camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally,
careful ablation studies and parts based analysis illustrate the advantages of
each of our contributions.
Related papers
- UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency [0.493599216374976]
We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision.
Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views.
This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications.
arXiv Detail & Related papers (2023-11-21T08:21:55Z) - Weakly Supervised Multi-Modal 3D Human Body Pose Estimation for
Autonomous Driving [0.5735035463793008]
3D human pose estimation is crucial for enabling autonomous vehicles (AVs) to make informed decisions and respond proactively in critical road scenarios.
We present a simple yet efficient weakly supervised approach for 3D HPE in the AV context by employing a high-level sensor fusion between camera and LiDAR data.
Our approach outperforms state-of-the-art results by up to $sim$ 13% on the Open dataset in the weakly supervised setting.
arXiv Detail & Related papers (2023-07-27T14:28:50Z) - DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion
Probabilistic Model [25.223801390996435]
This paper focuses on reconstructing a 3D pose from a single 2D keypoint detection.
We build a novel diffusion-based framework to effectively sample diverse 3D poses from an off-the-shelf 2D detector.
We evaluate our method on the widely adopted Human3.6M and HumanEva-I datasets.
arXiv Detail & Related papers (2022-12-06T07:22:20Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Multi-Person Absolute 3D Human Pose Estimation with Weak Depth
Supervision [0.0]
We introduce a network that can be trained with additional RGB-D images in a weakly supervised fashion.
Our algorithm is a monocular, multi-person, absolute pose estimator.
We evaluate the algorithm on several benchmarks, showing a consistent improvement in error rates.
arXiv Detail & Related papers (2020-04-08T13:29:22Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.