TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
- URL: http://arxiv.org/abs/2312.04248v1
- Date: Thu, 7 Dec 2023 12:10:05 GMT
- Title: TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
- Authors: Xuying Zhang and Bo-Wen Yin and Yuming Chen and Zheng Lin and Yunheng
Li and Qibin Hou and Ming-Ming Cheng
- Abstract summary: We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
- Score: 67.5351491691866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in the text-driven 3D stylization of a single object has been
considerably promoted by CLIP-based methods. However, the stylization of
multi-object 3D scenes is still impeded in that the image-text pairs used for
pre-training CLIP mostly consist of an object. Meanwhile, the local details of
multiple objects may be susceptible to omission due to the existing supervision
manner primarily relying on coarse-grained contrast of image-text pairs. To
overcome these challenges, we present a novel framework, dubbed TeMO, to parse
multi-object 3D scenes and edit their styles under the contrast supervision at
multiple levels. We first propose a Decoupled Graph Attention (DGA) module to
distinguishably reinforce the features of 3D surface points. Particularly, a
cross-modal graph is constructed to align the object points accurately and noun
phrases decoupled from the 3D mesh and textual description. Then, we develop a
Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss
between the words in the textual description and the randomly rendered images
are constructed to complement the coarse-grained loss. Extensive experiments
show that our method can synthesize high-quality stylized content and
outperform the existing methods over a wide range of multi-object 3D meshes.
Our code and results will be made publicly available
Related papers
- SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation.
We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level.
We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout [13.364394556439992]
Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR.
Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation.
We propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms.
Our framework achieves up to a textbf54% improvement by the multi-view CLIP score metric.
arXiv Detail & Related papers (2023-03-24T07:37:09Z) - Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models [21.622420436349245]
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses.
In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model.
arXiv Detail & Related papers (2023-03-21T16:21:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.