ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
- URL: http://arxiv.org/abs/2407.02040v1
- Date: Tue, 2 Jul 2024 08:12:14 GMT
- Title: ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
- Authors: Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang,
- Abstract summary: By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data.
Current score distillation methods are hard to scale up to a large amount of text prompts.
We propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones.
- Score: 41.88337159350505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.
Related papers
- FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow [17.919092916953183]
We propose a novel framework, named FlowDreamer, which yields high fidelity results with richer textual details and faster convergence.
Key insight is to leverage the coupling and reversible properties of the rectified flow model to search for the corresponding noise.
We introduce a novel Unique Matching Couple (UCM) loss, which guides the 3D model to optimize along the same trajectory.
arXiv Detail & Related papers (2024-08-09T11:40:20Z) - 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors [85.11117452560882]
We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors.
The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping.
The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation
arXiv Detail & Related papers (2024-03-04T17:26:28Z) - Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation [28.079441901818296]
We propose a text-to-3D method for Neural Radiance Fields (NeRFs) that explicitly enforces fine-grained view consistency.
Our method achieves state-of-the-art performance over existing text-to-3D methods.
arXiv Detail & Related papers (2023-12-19T01:09:49Z) - SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation [1.5892730797514436]
Text-to-image diffusion models often suffer from slow iterative sampling processes.
We present a novel image-free distillation scheme named $textbfSwiftBrush$.
SwiftBrush achieves an FID score of $textbf16.67$ and a CLIP score of $textbf0.29$ on the COCO-30K benchmark.
arXiv Detail & Related papers (2023-12-08T18:44:09Z) - Instant3D: Fast Text-to-3D with Sparse-View Generation and Large
Reconstruction Model [68.98311213582949]
We propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner.
Our method can generate diverse 3D assets of high visual quality within 20 seconds, two orders of magnitude faster than previous optimization-based methods.
arXiv Detail & Related papers (2023-11-10T18:03:44Z) - ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with
Variational Score Distillation [48.59711140119368]
We present variational score distillation (VSD) to explain and address issues in text-to-3D generation.
Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512times512$) and high-fidelity NeRF with rich structure and complex effects.
arXiv Detail & Related papers (2023-05-25T16:19:18Z) - 3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion [55.71215821923401]
We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code)
We propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute.
arXiv Detail & Related papers (2023-03-21T15:38:26Z) - DreamFusion: Text-to-3D using 2D Diffusion [52.52529213936283]
Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs.
In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis.
Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.
arXiv Detail & Related papers (2022-09-29T17:50:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.