Chasing Consistency in Text-to-3D Generation from a Single Image
- URL: http://arxiv.org/abs/2309.03599v1
- Date: Thu, 7 Sep 2023 09:50:48 GMT
- Title: Chasing Consistency in Text-to-3D Generation from a Single Image
- Authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang
Wang
- Abstract summary: We present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image.
Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness.
The geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency.
- Score: 35.60887743544786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-3D generation from a single-view image is a popular but challenging
task in 3D vision. Although numerous methods have been proposed, existing works
still suffer from the inconsistency issues, including 1) semantic
inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency,
resulting in distorted, overfitted, and over-saturated generations. In light of
the above issues, we present Consist3D, a three-stage framework Chasing for
semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a
single image, in which the first two stages aim to learn parameterized
consistency tokens, and the last stage is for optimization. Specifically, the
semantic encoding stage learns a token independent of views and estimations,
promoting semantic consistency and robustness. Meanwhile, the geometric
encoding stage learns another token with comprehensive geometry and
reconstruction constraints under novel-view estimations, reducing overfitting
and encouraging geometric consistency. Finally, the optimization stage benefits
from the semantic and geometric tokens, allowing a low classifier-free guidance
scale and therefore preventing oversaturation. Experimental results demonstrate
that Consist3D produces more consistent, faithful, and photo-realistic 3D
assets compared to previous state-of-the-art methods. Furthermore, Consist3D
also allows background and object editing through text prompts.
Related papers
- SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation.
We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level.
We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - Geometry-Aware Score Distillation via 3D Consistent Noising and Gradient Consistency Modeling [31.945761751215134]
We introduce 3D consistent noising, geometry-based gradient warping and novel gradient consistency loss.
We successfully address the geometric inconsistency problems in text-to-3D generation task with minimal cost and being compatible with existing score distillation-based models.
arXiv Detail & Related papers (2024-06-24T14:58:17Z) - DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion [25.392909885188676]
We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance.
We learn cross-view consistency and intricate detail from textual descriptions.
arXiv Detail & Related papers (2024-03-25T22:34:05Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion [115.82306502822412]
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing.
A corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing.
We study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures.
arXiv Detail & Related papers (2022-12-14T18:49:50Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - Self-Supervised Image Representation Learning with Geometric Set
Consistency [50.12720780102395]
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency.
Specifically, we introduce 3D geometric consistency into a contrastive learning framework to enforce the feature consistency within image views.
arXiv Detail & Related papers (2022-03-29T08:57:33Z) - Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware
Multi-view Geometry Consistency [40.56510679634943]
We propose a self-supervised training architecture by leveraging the multi-view geometry consistency.
We design three novel loss functions for multi-view consistency, including the pixel consistency loss, the depth consistency loss, and the facial landmark-based epipolar loss.
Our method is accurate and robust, especially under large variations of expressions, poses, and illumination conditions.
arXiv Detail & Related papers (2020-07-24T12:36:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.