Related papers: TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

URL: http://arxiv.org/abs/2309.17175v2
Date: Thu, 14 Mar 2024 07:36:29 GMT
Title: TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields
Authors: Tianyu Huang, Yihan Zeng, Bowen Dong, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo,
Abstract summary: We introduce a conditional 3D generative model, namely TextField3D. Rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator.
Score: 98.62319447738332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.

Related papers

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z)
SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation. We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level. We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
WordRobe: Text-Guided Generation of Textured 3D Garments [30.614451083408266]
"WordRobe" is a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment synthesis, and text-driven texture synthesis.
arXiv Detail & Related papers (2024-03-26T09:44:34Z)
Text-to-3D Shape Generation [18.76771062964711]
Computational systems that can perform text-to-3D shape generation have captivated the popular imagination. We provide a survey of the underlying technology and methods enabling text-to-3D shape generation to summarize the background literature. We then derive a systematic categorization of recent work on text-to-3D shape generation based on the type of supervision data required.
arXiv Detail & Related papers (2024-03-20T04:03:44Z)
HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation [55.95329424826433]
We propose HyperSDFusion, a dual-branch diffusion model that generates 3D shapes from a given text. We learn the hierarchical representations of text and 3D shapes in hyperbolic space. Our method is the first to explore the hyperbolic hierarchical representation for text-to-shape generation.
arXiv Detail & Related papers (2024-03-01T08:57:28Z)
Learning Continuous 3D Words for Text-to-Image Generation [44.210565557606465]
We present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously.
arXiv Detail & Related papers (2024-02-13T18:34:10Z)
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images [105.92311979305065]
TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models.
arXiv Detail & Related papers (2023-08-31T14:26:33Z)
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields [29.907615852310204]
We present Text2NeRF, which is able to generate a wide range of 3D scenes purely from a text prompt. Our method requires no additional training data but only a natural language description of the scene as the input.
arXiv Detail & Related papers (2023-05-19T10:58:04Z)
3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation [107.46972849241168]
3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture. Experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects.
arXiv Detail & Related papers (2022-12-02T11:31:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.