Related papers: Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds

Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds

URL: http://arxiv.org/abs/2508.14892v1
Date: Wed, 20 Aug 2025 17:59:11 GMT
Title: Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds
Authors: Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, Xinggang Wang,
Abstract summary: We propose a challenging but valuable task to reconstruct the human body from only two images.<n>The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input.<n> Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA GTX 4090.
Score: 71.22182851672314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at https://hustvl.github.io/Snap-Snap/.

Related papers

SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Texture 3D Human Reconstruction [7.584417190255802]
Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image.<n>We propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner.<n>We also propose an Online Animation Augmentation module to tackle data scarcity and improve reconstruction quality.
arXiv Detail & Related papers (2025-08-27T08:52:35Z)
Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets [55.84702107871358]
3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges.<n>Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space.<n>This study proposes a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form.
arXiv Detail & Related papers (2025-05-23T14:58:34Z)
FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images. This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets. We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z)
UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling [71.87807614875497]
We propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures. We collect and process a new dataset of human motion, which includes multi-view images, scanned models, parametric model registration, and corresponding texture maps. Experimental results demonstrate that our method achieves state-of-the-art synthesis of novel view and novel pose.
arXiv Detail & Related papers (2024-03-18T09:03:56Z)
SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion [35.73448283467723]
SiTH is a novel pipeline that integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. We employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images.
arXiv Detail & Related papers (2023-11-27T14:22:07Z)
High-fidelity 3D Human Digitization from Single 2K Resolution Images [16.29087820634057]
We propose 2K2K, which constructs a large-scale 2K human dataset and infers 3D human models from 2K resolution images. We also provide 2,050 3D human models, including texture maps, 3D joints, and SMPL parameters for research purposes.
arXiv Detail & Related papers (2023-03-27T11:22:54Z)
NeuralReshaper: Single-image Human-body Retouching with Deep Neural Networks [50.40798258968408]
We present NeuralReshaper, a novel method for semantic reshaping of human bodies in single images using deep generative networks. Our approach follows a fit-then-reshape pipeline, which first fits a parametric 3D human model to a source human image. To deal with the lack-of-data problem that no paired data exist, we introduce a novel self-supervised strategy to train our network.
arXiv Detail & Related papers (2022-03-20T09:02:13Z)
RIN: Textured Human Model Recovery and Imitation with a Single Image [4.87676530016726]
We propose a novel volume-based framework for reconstructing a textured 3D model from a single picture. Specifically, to estimate most of the human texture, we propose a U-Net-like front-to-back translation network. Our experiments demonstrate that our volume-based model is adequate for human imitation, and the back view can be estimated reliably using our network.
arXiv Detail & Related papers (2020-11-24T11:04:35Z)
Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose [70.23652933572647]
We propose a novel graph convolutional neural network (GraphCNN)-based system that estimates the 3D coordinates of human mesh vertices directly from the 2D human pose. We show that our Pose2Mesh outperforms the previous 3D human pose and mesh estimation methods on various benchmark datasets.
arXiv Detail & Related papers (2020-08-20T16:01:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.