3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
- URL: http://arxiv.org/abs/2212.01103v2
- Date: Wed, 16 Aug 2023 07:12:41 GMT
- Title: 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
- Authors: Zutao Jiang, Guansong Lu, Xiaodan Liang, Jihua Zhu, Wei Zhang, Xiaojun
Chang, Hang Xu
- Abstract summary: 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture.
Experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects.
- Score: 107.46972849241168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided 3D object generation aims to generate 3D objects described by
user-defined captions, which paves a flexible way to visualize what we
imagined. Although some works have been devoted to solving this challenging
task, these works either utilize some explicit 3D representations (e.g., mesh),
which lack texture and require post-processing for rendering photo-realistic
views; or require individual time-consuming optimization for every single case.
Here, we make the first attempt to achieve generic text-guided cross-category
3D object generation via a new 3D-TOGO model, which integrates a text-to-views
generation module and a views-to-3D generation module. The text-to-views
generation module is designed to generate different views of the target 3D
object given an input caption. prior-guidance, caption-guidance and view
contrastive learning are proposed for achieving better view-consistency and
caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D
generation module to obtain the implicit 3D neural representation from the
previously-generated views. Our 3D-TOGO model generates 3D objects in the form
of the neural radiance field with good texture and requires no time-cost
optimization for every single caption. Besides, 3D-TOGO can control the
category, color and shape of generated 3D objects with the input caption.
Extensive experiments on the largest 3D object dataset (i.e., ABO) are
conducted to verify that 3D-TOGO can better generate high-quality 3D objects
according to the input captions across 98 different categories, in terms of
PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.
Related papers
- View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion [18.82883336156591]
We present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes.
PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.
arXiv Detail & Related papers (2023-12-14T16:04:34Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision [114.56048848216254]
We present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions.
Based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates.
Our constructed captions provide high-level semantic supervision for generated 3D shapes.
arXiv Detail & Related papers (2023-03-23T13:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.