Related papers: 3D Human Mesh Estimation from Single View RGBD

3D Human Mesh Estimation from Single View RGBD

URL: http://arxiv.org/abs/2508.08178v2
Date: Tue, 12 Aug 2025 16:25:31 GMT
Title: 3D Human Mesh Estimation from Single View RGBD
Authors: Ozhan Suat, Bedirhan Uguz, Batuhan Karagoz, Muhammed Can Keles, Emre Akbas,
Abstract summary: We present a method for accurate 3D human mesh estimation from a single RGBD view.<n>We leverage existing Motion Capture (MoCap) datasets to overcome data scarcity.<n>We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm.
Score: 7.835177716421862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling'', matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.

Related papers

CameraHMR: Aligning People with Perspective [54.05758012879385]
We address the challenge of accurate 3D human pose and shape estimation from monocular images. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations. We make two contributions that improve pGT accuracy.
arXiv Detail & Related papers (2024-11-12T19:12:12Z)
FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images. This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets. We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z)
MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints [8.405938712823563]
Key2Mesh is a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Our results show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE and 3DPW datasets.
arXiv Detail & Related papers (2024-04-10T15:34:10Z)
3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models [52.96248836582542]
We propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.
arXiv Detail & Related papers (2024-03-17T06:31:16Z)
LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation [31.651300414497822]
LiCamPose is a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset.
arXiv Detail & Related papers (2023-12-11T14:30:11Z)
Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images [11.100398985633754]
We propose an end-to-end framework for recovering dense meshes for both hands. Our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud. We also introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales.
arXiv Detail & Related papers (2023-07-12T09:33:21Z)
Sampling is Matter: Point-guided 3D Human Mesh Reconstruction [0.0]
This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction.
arXiv Detail & Related papers (2023-04-19T08:45:26Z)
MPT: Mesh Pre-Training with Transformers for Human Pose and Mesh Reconstruction [56.80384196339199]
Mesh Pre-Training (MPT) is a new pre-training framework that leverages 3D mesh data such as MoCap data for human pose and mesh reconstruction from a single image. MPT enables transformer models to have zero-shot capability of human mesh reconstruction from real images.
arXiv Detail & Related papers (2022-11-24T00:02:13Z)
VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion [62.24001258298076]
VPFNet is a new architecture that cleverly aligns and aggregates the point cloud and image data at the virtual' points. Our VPFNet achieves 83.21% moderate 3D AP and 91.86% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021.
arXiv Detail & Related papers (2021-11-29T08:51:20Z)
Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image. The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images. We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)
EPOS: Estimating 6D Pose of Objects with Symmetries [57.448933686429825]
We present a new method for estimating the 6D pose of rigid objects with available 3D models from a single RGB input. An object is represented by compact surface fragments which allow symmetries in a systematic manner. Correspondences between densely sampled pixels and the fragments are predicted using an encoder-decoder network.
arXiv Detail & Related papers (2020-04-01T17:41:08Z)
PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction [7.582064461041252]
PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D. We train PeelGAN using a 3D Chamfer loss and other 2D losses to generate multiple depth values per-pixel and a corresponding RGB field per-vertex. In our simple non-parametric solution, the generated Peeled Depth maps are back-projected to 3D space to obtain a complete textured 3D shape.
arXiv Detail & Related papers (2020-02-16T20:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.