HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers
- URL: http://arxiv.org/abs/2506.03118v1
- Date: Tue, 03 Jun 2025 17:50:05 GMT
- Title: HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers
- Authors: Zhiyuan Yu, Zhe Li, Hujun Bao, Can Yang, Xiaowei Zhou,
- Abstract summary: HumanRAM is a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images.<n>Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions.<n> Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
- Score: 60.86393841247567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at https://zju3dv.github.io/humanram/.
Related papers
- PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images [23.745241278910946]
PF-LHM is a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images.<n>Our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations.
arXiv Detail & Related papers (2025-06-16T17:59:56Z) - ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos [18.73641648585445]
Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses.<n>We introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion.<n>Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully.
arXiv Detail & Related papers (2025-04-17T17:59:02Z) - FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images [74.86864398919467]
We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images.<n>We learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization.<n>Our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos.
arXiv Detail & Related papers (2025-03-24T23:20:47Z) - LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds [21.99354901986186]
We propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass.<n>Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism.<n>Our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
arXiv Detail & Related papers (2025-03-13T17:59:21Z) - HumanGif: Single-View Human Diffusion with Generative Prior [25.516544735593087]
We propose HumanGif, a single-view human diffusion model with generative priors.<n>Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process.<n>We show that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
arXiv Detail & Related papers (2025-02-17T17:55:27Z) - GAS: Generative Avatar Synthesis from a Single Image [54.95198111659466]
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image.<n>Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model.
arXiv Detail & Related papers (2025-02-10T19:00:39Z) - WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction [51.22641018932625]
We present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis.<n>Our method achieves SOTA performance in producing photorealistic renderings from the given monocular video.
arXiv Detail & Related papers (2025-02-03T04:43:41Z) - AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction [26.82525451095629]
We propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference.<n>We recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting.<n>Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images.
arXiv Detail & Related papers (2024-12-03T18:55:39Z) - Deformable 3D Gaussian Splatting for Animatable Human Avatars [50.61374254699761]
We propose a fully explicit approach to construct a digital avatar from as little as a single monocular sequence.
ParDy-Human constitutes an explicit model for realistic dynamic human avatars which requires significantly fewer training views and images.
Our avatars learning is free of additional annotations such as Splat masks and can be trained with variable backgrounds while inferring full-resolution images efficiently even on consumer hardware.
arXiv Detail & Related papers (2023-12-22T20:56:46Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - SMPLpix: Neural Avatars from 3D Human Models [56.85115800735619]
We bridge the gap between classic rendering and the latest generative networks operating in pixel space.
We train a network that directly converts a sparse set of 3D mesh vertices into photorealistic images.
We show the advantage over conventional differentiables both in terms of the level of photorealism and rendering efficiency.
arXiv Detail & Related papers (2020-08-16T10:22:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.