Controlling Avatar Diffusion with Learnable Gaussian Embedding
- URL: http://arxiv.org/abs/2503.15809v1
- Date: Thu, 20 Mar 2025 02:52:01 GMT
- Title: Controlling Avatar Diffusion with Learnable Gaussian Embedding
- Authors: Xuan Gao, Jingtao Zhou, Dongyu Liu, Yuqi Zhou, Juyong Zhang,
- Abstract summary: We introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent.<n>We synthesize a large-scale dataset with multiple poses and identities.<n>Our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency.
- Score: 27.651478116386354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: https://ustc3dv.github.io/Learn2Control/
Related papers
- Synthetic Prior for Few-Shot Drivable Head Avatar Inversion [61.51887011274453]
We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior.<n>Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads.
arXiv Detail & Related papers (2025-01-12T19:01:05Z) - Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles [81.29018359825872]
This paper consolidates a set of good practices to finetune large pretrained models for a real-world task.<n>Specifically, we develop several strategies to account for discrepancies between the synthetic data and real driving data.<n>Our insights lead to effective finetuning that results in a $68.8%$ reduction in FID for novel view synthesis over prior arts.
arXiv Detail & Related papers (2024-12-19T03:39:13Z) - SynthForge: Synthesizing High-Quality Face Dataset with Controllable 3D Generative Models [8.004597666699036]
Recent advancements in generative models have unlocked the capabilities to render photo-realistic data in a controllable fashion.
However, using the data generated using such models for training downstream tasks remains under-explored, mainly due to the lack of 3D consistent annotations.
Our experiments show competitive performance against state-of-the-art models using only generated synthetic data, demonstrating potential for solving downstream tasks.
arXiv Detail & Related papers (2024-06-12T03:15:15Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - RGB-D Mapping and Tracking in a Plenoxel Radiance Field [5.239559610798646]
We present the vital differences between view synthesis models and 3D reconstruction models.
We also comment on why a depth sensor is essential for modeling accurate geometry in general outward-facing scenes.
Our method achieves state-of-the-art results in both mapping and tracking tasks, while also being faster than competing neural network-based approaches.
arXiv Detail & Related papers (2023-07-07T06:05:32Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - DreamFusion: Text-to-3D using 2D Diffusion [52.52529213936283]
Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs.
In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis.
Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.
arXiv Detail & Related papers (2022-09-29T17:50:40Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.