UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
- URL: http://arxiv.org/abs/2509.24817v1
- Date: Mon, 29 Sep 2025 14:06:00 GMT
- Title: UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
- Authors: Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu,
- Abstract summary: UP2You is a tuning-free solution for reconstructing high-fidelity 3D portraits from unconstrained in-the-wild 2D photos.<n>Central to UP2You is a pose-correlated feature aggregation module.<n>Experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy and texture fidelity.
- Score: 21.55668740343458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You
Related papers
- RUMPL: Ray-Based Transformers for Universal Multi-View 2D to 3D Human Pose Lifting [81.66201044236321]
Estimating 3D human poses from 2D images remains challenging.<n>Recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data.<n>We propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints.
arXiv Detail & Related papers (2025-12-17T14:37:27Z) - SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images [49.7344030427291]
We study the problem of single-image 3D object reconstruction.<n>Recent works have diverged into two directions: regression-based modeling and generative modeling.<n>We present SPAR3D, a novel two-stage approach aiming to take the best of both directions.
arXiv Detail & Related papers (2025-01-08T18:52:03Z) - Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images [11.1039786318131]
Dust to Tower (D2T) is an efficient framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images.<n>Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints.<n>Experiments and ablation studies demonstrate the validity of D2T and its design choices.
arXiv Detail & Related papers (2024-12-27T08:19:34Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose
Estimation [10.374944534302234]
"lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE)
Rich semantic and texture information in images can contribute to a more accurate "lifting" procedure.
In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features.
arXiv Detail & Related papers (2023-12-25T07:50:58Z) - Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency [0.493599216374976]
We introduce a novel loss function, consistency loss, which operates on two synchronized views.<n>Our consistency loss substantially improves performance for fine-tuning without requiring 3D data.<n>We show that using our consistency loss can yield state-of-the-art performance when training models from scratch in a semi-supervised manner.
arXiv Detail & Related papers (2023-11-21T08:21:55Z) - Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from
Sparse Image Ensemble [72.3681707384754]
Hi-LASSIE performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates.
First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image.
Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance.
arXiv Detail & Related papers (2022-12-21T14:31:33Z) - VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual
Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task.
The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses.
It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z) - Shape-aware Multi-Person Pose Estimation from Multi-View Images [47.13919147134315]
Our proposed coarse-to-fine pipeline first aggregates noisy 2D observations from multiple camera views into 3D space.
The final pose estimates are attained from a novel optimization scheme which links high-confidence multi-view 2D observations and 3D joint candidates.
arXiv Detail & Related papers (2021-10-05T20:04:21Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.