GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers
- URL: http://arxiv.org/abs/2409.04196v2
- Date: Wed, 16 Apr 2025 14:37:31 GMT
- Title: GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers
- Authors: Lorenza Prospero, Abdullah Hamdi, Joao F. Henriques, Christian Rupprecht,
- Abstract summary: Reconstructing posed 3D human models from monocular images has important applications in the sports industry.<n>We combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians.<n>We show that this combination can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision.
- Score: 23.96688843662126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reconstructing posed 3D human models from monocular images has important applications in the sports industry, including performance tracking, injury prevention and virtual training. In this work, we combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians. This allows training or fine-tuning a human model predictor on multi-view images alone, without 3D ground truth. Predicting such mixtures for a human from a single input image is challenging due to self-occlusions and dependence on articulations, while also needing to retain enough flexibility to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate spatial density and approximate initial position for the Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other 3DGS attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision, thus making it ideal for the sport industry at any level. More importantly, rendering is an effective auxiliary objective to refine 3D pose estimation by accounting for clothes and other geometric variations. The code is available at https://github.com/prosperolo/GST.
Related papers
- SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets [72.26350984924129]
We propose a latent space generation paradigm for 3D human digitization.
We transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift.
We employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset.
arXiv Detail & Related papers (2025-04-09T15:38:18Z) - HuGDiffusion: Generalizable Single-Image Human Rendering via 3D Gaussian Diffusion [50.02316409061741]
HuGDiffusion is a learning pipeline to achieve novel view synthesis (NVS) of human characters from single-view input images.
We aim to generate the set of 3DGS attributes via a diffusion-based framework conditioned on human priors extracted from a single image.
Our HuGDiffusion shows significant performance improvements over the state-of-the-art methods.
arXiv Detail & Related papers (2025-01-25T01:00:33Z) - iHuman: Instant Animatable Digital Humans From Monocular Videos [16.98924995658091]
We present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos.
This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body.
Our method is faster by an order of magnitude (in terms of training time) than its closest competitor.
arXiv Detail & Related papers (2024-07-15T18:51:51Z) - Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation [32.30055363306321]
We propose a paradigm for seamlessly unifying different human pose and shape-related tasks and datasets.
Our formulation is centered on the ability to query any arbitrary point of the human volume, and obtain its estimated location in 3D.
arXiv Detail & Related papers (2024-07-10T10:44:18Z) - Generalizable Human Gaussians from Single-View Image [52.100234836129786]
We introduce a single-view generalizable Human Gaussian Model (HGM)
Our approach uses a ControlNet to refine rendered back-view images from coarse predicted human Gaussians.
To mitigate the potential generation of unrealistic human poses and shapes, we incorporate human priors from the SMPL-X model as a dual branch.
arXiv Detail & Related papers (2024-06-10T06:38:11Z) - 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models [52.96248836582542]
We propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations.
By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.
arXiv Detail & Related papers (2024-03-17T06:31:16Z) - Deformable 3D Gaussian Splatting for Animatable Human Avatars [50.61374254699761]
We propose a fully explicit approach to construct a digital avatar from as little as a single monocular sequence.
ParDy-Human constitutes an explicit model for realistic dynamic human avatars which requires significantly fewer training views and images.
Our avatars learning is free of additional annotations such as Splat masks and can be trained with variable backgrounds while inferring full-resolution images efficiently even on consumer hardware.
arXiv Detail & Related papers (2023-12-22T20:56:46Z) - GauHuman: Articulated Gaussian Splatting from Monocular Human Videos [58.553979884950834]
GauHuman is a 3D human model with Gaussian Splatting for both fast training (1 2 minutes) and real-time rendering (up to 189 FPS)
GauHuman encodes Gaussian Splatting in the canonical space and transforms 3D Gaussians from canonical space to posed space with linear blend skinning (LBS)
Experiments on ZJU_Mocap and MonoCap datasets demonstrate that GauHuman achieves state-of-the-art performance quantitatively and qualitatively with fast training and real-time rendering speed.
arXiv Detail & Related papers (2023-12-05T18:59:14Z) - HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting [113.37908093915837]
Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time.
In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance.
arXiv Detail & Related papers (2023-11-28T18:59:58Z) - Animatable 3D Gaussians for High-fidelity Synthesis of Human Motions [37.50707388577952]
We present a novel animatable 3D Gaussian model for rendering high-fidelity free-view human motions in real time.
Compared to existing NeRF-based methods, the model owns better capability in high-frequency details without the jittering problem across video frames.
arXiv Detail & Related papers (2023-11-22T14:00:23Z) - SplatArmor: Articulated Gaussian splatting for animatable humans from
monocular RGB videos [15.74530749823217]
We propose SplatArmor, a novel approach for recovering detailed and animatable human models by armoring' a parameterized body model with 3D Gaussians.
Our approach represents the human as a set of 3D Gaussians within a canonical space, whose articulation is defined by extending the skinning of the underlying SMPL geometry.
We show compelling results on the ZJU MoCap and People Snapshot datasets, which underscore the effectiveness of our method for controllable human synthesis.
arXiv Detail & Related papers (2023-11-17T18:47:07Z) - Drivable 3D Gaussian Avatars [26.346626608626057]
Current drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both.
This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates.
Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications.
arXiv Detail & Related papers (2023-11-14T22:54:29Z) - AvatarGen: A 3D Generative Model for Animatable Human Avatars [108.11137221845352]
AvatarGen is an unsupervised generation of 3D-aware clothed humans with various appearances and controllable geometries.
Our method can generate animatable 3D human avatars with high-quality appearance and geometry modeling.
It is competent for many applications, e.g., single-view reconstruction, re-animation, and text-guided synthesis/editing.
arXiv Detail & Related papers (2022-11-26T15:15:45Z) - UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body
Decoupling 3D Model [58.70130563417079]
We introduce a new 3D human-body model with a series of decoupled parameters that could freely control the generation of the body.
Compared to the existing manually annotated DensePose-COCO dataset, the synthetic UltraPose has ultra dense image-to-surface correspondences without annotation cost and error.
arXiv Detail & Related papers (2021-10-28T16:24:55Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.