Instant3D: Instant Text-to-3D Generation
- URL: http://arxiv.org/abs/2311.08403v2
- Date: Mon, 29 Apr 2024 04:47:25 GMT
- Title: Instant3D: Instant Text-to-3D Generation
- Authors: Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu,
- Abstract summary: We propose a novel framework for fast text-to-3D generation, dubbed Instant3D.
Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network.
- Score: 101.25562463919795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.
Related papers
- VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation [69.68568248073747]
We propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks.
PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps.
For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details.
arXiv Detail & Related papers (2024-06-21T08:21:52Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis [10.151307760539071]
This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation.
We first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions.
We then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object.
arXiv Detail & Related papers (2024-03-17T17:04:45Z) - TPA3D: Triplane Attention for Fast Text-to-3D Generation [28.33270078863519]
We propose Triplane Attention for text-guided 3D generation (TPA3D)
TPA3D is an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation.
We show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions.
arXiv Detail & Related papers (2023-12-05T10:39:37Z) - Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention [9.52027244702166]
Spice-E is a neural network that adds structural guidance to 3D diffusion models.
We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D.
arXiv Detail & Related papers (2023-11-29T17:36:49Z) - ET3D: Efficient Text-to-3D Generation via Multi-View Distillation [11.520777124553195]
We present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card.
Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.
arXiv Detail & Related papers (2023-11-27T06:14:23Z) - Instant3D: Fast Text-to-3D with Sparse-View Generation and Large
Reconstruction Model [68.98311213582949]
We propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner.
Our method can generate diverse 3D assets of high visual quality within 20 seconds, two orders of magnitude faster than previous optimization-based methods.
arXiv Detail & Related papers (2023-11-10T18:03:44Z) - Progressive Text-to-3D Generation for Automatic 3D Prototyping [83.33407603057618]
We propose a Multi-Scale Triplane Network (MTN) and a new progressive learning strategy.
Our experiment verifies that the proposed method performs favorably against existing methods.
We aspire for our work to pave the way for automatic 3D prototyping via natural language descriptions.
arXiv Detail & Related papers (2023-09-26T01:08:35Z) - ATT3D: Amortized Text-to-3D Object Synthesis [78.96673650638365]
We amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately.
Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooths between text for novel assets and simple animations.
arXiv Detail & Related papers (2023-06-06T17:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.