ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
- URL: http://arxiv.org/abs/2504.02316v1
- Date: Thu, 03 Apr 2025 06:43:23 GMT
- Title: ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
- Authors: Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han,
- Abstract summary: We propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process.<n>We show that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.
- Score: 46.64928459085584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.
Related papers
- SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation.
We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level.
We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation [38.32887919831611]
We propose textbfJoint textbfScore textbfDistillation (JSD), a new paradigm that ensures coherent 3D generations.
JSD significantly mitigates the 3D inconsistency problem in Score Distillation Sampling.
Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation.
arXiv Detail & Related papers (2024-07-17T03:23:47Z) - Geometry-Aware Score Distillation via 3D Consistent Noising and Gradient Consistency Modeling [31.945761751215134]
We introduce 3D consistent noising, geometry-based gradient warping and novel gradient consistency loss.
We successfully address the geometric inconsistency problems in text-to-3D generation task with minimal cost and being compatible with existing score distillation-based models.
arXiv Detail & Related papers (2024-06-24T14:58:17Z) - 3D Face Modeling via Weakly-supervised Disentanglement Network joint Identity-consistency Prior [62.80458034704989]
Generative 3D face models featuring disentangled controlling factors hold immense potential for diverse applications in computer vision and computer graphics.
Previous 3D face modeling methods face a challenge as they demand specific labels to effectively disentangle these factors.
This paper introduces a Weakly-Supervised Disentanglement Framework, denoted as WSDF, to facilitate the training of controllable 3D face models without an overly stringent labeling requirement.
arXiv Detail & Related papers (2024-04-25T11:50:47Z) - GeoGS3D: Single-view 3D Reconstruction via Geometric-aware Diffusion Model and Gaussian Splatting [81.03553265684184]
We introduce GeoGS3D, a framework for reconstructing detailed 3D objects from single-view images.
We propose a novel metric, Gaussian Divergence Significance (GDS), to prune unnecessary operations during optimization.
Experiments demonstrate that GeoGS3D generates images with high consistency across views and reconstructs high-quality 3D objects.
arXiv Detail & Related papers (2024-03-15T12:24:36Z) - Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [57.986512832738704]
We present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model.
Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach.
These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model.
arXiv Detail & Related papers (2024-03-14T07:39:59Z) - Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models [16.326276673056334]
Consistent-1-to-3 is a generative framework that significantly mitigates this issue.
We decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions.
We propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information.
arXiv Detail & Related papers (2023-10-04T17:58:57Z) - Chasing Consistency in Text-to-3D Generation from a Single Image [35.60887743544786]
We present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image.
Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness.
The geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency.
arXiv Detail & Related papers (2023-09-07T09:50:48Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Debiasing Scores and Prompts of 2D Diffusion for View-consistent
Text-to-3D Generation [38.032010026146146]
We propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation.
One of the most notable issues is the Janus problem, where the most canonical view of an object appears in other views.
Our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead.
arXiv Detail & Related papers (2023-03-27T17:31:13Z) - Towards Realistic 3D Embedding via View Alignment [53.89445873577063]
This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically.
VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable.
arXiv Detail & Related papers (2020-07-14T14:45:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.