Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation
- URL: http://arxiv.org/abs/2412.13454v1
- Date: Wed, 18 Dec 2024 02:54:30 GMT
- Title: Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation
- Authors: Xiaoqi An, Lin Zhao, Chen Gong, Jun Li, Jian Yang,
- Abstract summary: LiDAR-based 3D Human Pose Estimation is becoming a research focus.
Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results.
We propose a simple yet powerful method, which provides insights both on modeling and augmentation of point clouds.
- Score: 27.25933965875881
- License:
- Abstract: With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by $10.0mm$. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by $20.7mm$.
Related papers
- CameraHMR: Aligning People with Perspective [54.05758012879385]
We address the challenge of accurate 3D human pose and shape estimation from monocular images.
Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations.
We make two contributions that improve pGT accuracy.
arXiv Detail & Related papers (2024-11-12T19:12:12Z) - Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation [32.30055363306321]
We propose a paradigm for seamlessly unifying different human pose and shape-related tasks and datasets.
Our formulation is centered on the ability to query any arbitrary point of the human volume, and obtain its estimated location in 3D.
arXiv Detail & Related papers (2024-07-10T10:44:18Z) - 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - Weakly Supervised 3D Multi-person Pose Estimation for Large-scale Scenes
based on Monocular Camera and Single LiDAR [41.39277657279448]
We propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes.
Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud.
Our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision.
arXiv Detail & Related papers (2022-11-30T12:50:40Z) - LiDAR-aid Inertial Poser: Large-scale Human Motion Capture by Sparse
Inertial and LiDAR Sensors [38.60837840737258]
We propose a multi-sensor fusion method for capturing 3D human motions with accurate consecutive local poses and global trajectories in large-scale scenarios.
We design a two-stage pose estimator in a coarse-to-fine manner, where point clouds provide the coarse body shape and IMU measurements optimize the local actions.
We collect a LiDAR-IMU multi-modal mocap dataset, LIPD, with diverse human actions in long-range scenarios.
arXiv Detail & Related papers (2022-05-30T20:15:11Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in
Autonomous Driving [74.74519047735916]
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors.
Data collected for other use cases (such as virtual reality, gaming, and animation) may not be usable for AV applications.
We propose one of the first approaches to alleviate this problem in the AV setting.
arXiv Detail & Related papers (2021-12-22T18:57:16Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.