Large-Vocabulary 3D Diffusion Model with Transformer
- URL: http://arxiv.org/abs/2309.07920v2
- Date: Fri, 15 Sep 2023 07:56:34 GMT
- Title: Large-Vocabulary 3D Diffusion Model with Transformer
- Authors: Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu
- Abstract summary: We introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model.
We propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance.
- Score: 57.076986347047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating diverse and high-quality 3D assets with an automatic generative
model is highly desirable. Despite extensive efforts on 3D generation, most
existing works focus on the generation of a single category or a few
categories. In this paper, we introduce a diffusion-based feed-forward
framework for synthesizing massive categories of real-world 3D objects with a
single generative model. Notably, there are three major challenges for this
large-vocabulary 3D generation: a) the need for expressive yet efficient 3D
representation; b) large diversity in geometry and texture across categories;
c) complexity in the appearances of real-world objects. To this end, we propose
a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for
handling challenges via three aspects. 1) Considering efficiency and
robustness, we adopt a revised triplane representation and improve the fitting
speed and accuracy. 2) To handle the drastic variations in geometry and
texture, we regard the features of all 3D objects as a combination of
generalized 3D knowledge and specialized 3D features. To extract generalized 3D
knowledge from diverse categories, we propose a novel 3D-aware transformer with
shared cross-plane attention. It learns the cross-plane relations across
different planes and aggregates the generalized 3D knowledge with specialized
3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance
the generalized 3D knowledge in the encoded triplanes for handling categories
with complex appearances. Extensive experiments on ShapeNet and OmniObject3D
(over 200 diverse real-world categories) convincingly demonstrate that a single
DiffTF model achieves state-of-the-art large-vocabulary 3D object generation
performance with large diversity, rich semantics, and high quality.
Related papers
- 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation [45.218605449572586]
3D-Adapter is a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models.
We show that 3D-Adapter greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++.
We also showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.
arXiv Detail & Related papers (2024-10-24T17:59:30Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D
Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data.
We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z) - Pushing the Limits of 3D Shape Generation at Scale [65.24420181727615]
We present a significant breakthrough in 3D shape generation by scaling it to unprecedented dimensions.
We have developed a model with an astounding 3.6 billion trainable parameters, establishing it as the largest 3D shape generation model to date, named Argus-3D.
arXiv Detail & Related papers (2023-06-20T13:01:19Z) - OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic
Perception, Reconstruction and Generation [107.71752592196138]
We propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects.
It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets.
Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos.
arXiv Detail & Related papers (2023-01-18T18:14:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.