AToM: Amortized Text-to-Mesh using 2D Diffusion
- URL: http://arxiv.org/abs/2402.00867v1
- Date: Thu, 1 Feb 2024 18:59:56 GMT
- Title: AToM: Amortized Text-to-Mesh using 2D Diffusion
- Authors: Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang
Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye
Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey
Tulyakov
- Abstract summary: Amortized Text-to-Mesh (AToM) is a feed-forward framework optimized across multiple text prompts simultaneously.
AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost.
AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy.
- Score: 107.02696990299032
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh
framework optimized across multiple text prompts simultaneously. In contrast to
existing text-to-3D methods that often entail time-consuming per-prompt
optimization and commonly output representations other than polygonal meshes,
AToM directly generates high-quality textured meshes in less than 1 second with
around 10 times reduction in the training cost, and generalizes to unseen
prompts. Our key idea is a novel triplane-based text-to-mesh architecture with
a two-stage amortized optimization strategy that ensures stable training and
enables scalability. Through extensive experiments on various prompt
benchmarks, AToM significantly outperforms state-of-the-art amortized
approaches with over 4 times higher accuracy (in DF415 dataset) and produces
more distinguishable and higher-quality 3D outputs. AToM demonstrates strong
generalizability, offering finegrained 3D assets for unseen interpolated
prompts without further optimization during inference, unlike per-prompt
solutions.
Related papers
- OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis [76.43669909525488]
LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.
We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set.
arXiv Detail & Related papers (2024-03-22T17:59:37Z) - Instant3D: Fast Text-to-3D with Sparse-View Generation and Large
Reconstruction Model [68.98311213582949]
We propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner.
Our method can generate diverse 3D assets of high visual quality within 20 seconds, two orders of magnitude faster than previous optimization-based methods.
arXiv Detail & Related papers (2023-11-10T18:03:44Z) - ATT3D: Amortized Text-to-3D Object Synthesis [78.96673650638365]
We amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately.
Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooths between text for novel assets and simple animations.
arXiv Detail & Related papers (2023-06-06T17:59:10Z) - DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual
Information for Real-time Semantic Segmentation [10.379708894083217]
We propose a highly efficient multi-scale feature extraction method, which decomposes the original single-step method into two steps, Region Residualization-Semantic Residualization.
We achieve an mIoU of 72.7% on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU.
arXiv Detail & Related papers (2022-12-02T13:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.