Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching
- URL: http://arxiv.org/abs/2506.13594v1
- Date: Mon, 16 Jun 2025 15:21:30 GMT
- Title: Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching
- Authors: Weimin Bai, Yubo Li, Wenzheng Chen, Weijian Luo, He Sun,
- Abstract summary: We introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss.<n>We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments.<n> Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.
- Score: 14.267619174518106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.
Related papers
- Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting [51.08718483081347]
We propose a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation.<n>We derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints.<n>We employ a deformable tetrahedral grid, from 3D-GS and refined through CSD, to produce high-quality, refined meshes.
arXiv Detail & Related papers (2025-05-07T09:12:45Z) - Semantic Score Distillation Sampling for Compositional Text-to-3D Generation [28.88237230872795]
Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research.
We introduce a novel SDS approach, designed to improve the expressiveness and accuracy of compositional text-to-3D generation.
Our approach integrates new semantic embeddings that maintain consistency across different rendering views.
By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-11T17:26:00Z) - JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation [38.32887919831611]
We propose textbfJoint textbfScore textbfDistillation (JSD), a new paradigm that ensures coherent 3D generations.
JSD significantly mitigates the 3D inconsistency problem in Score Distillation Sampling.
Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation.
arXiv Detail & Related papers (2024-07-17T03:23:47Z) - Geometry-Aware Score Distillation via 3D Consistent Noising and Gradient Consistency Modeling [31.945761751215134]
We introduce 3D consistent noising, geometry-based gradient warping and novel gradient consistency loss.
We successfully address the geometric inconsistency problems in text-to-3D generation task with minimal cost and being compatible with existing score distillation-based models.
arXiv Detail & Related papers (2024-06-24T14:58:17Z) - Retrieval-Augmented Score Distillation for Text-to-3D Generation [30.57225047257049]
We introduce novel framework for retrieval-based quality enhancement in text-to-3D generation.
We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency.
arXiv Detail & Related papers (2024-02-05T12:50:30Z) - Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D
Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data.
We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z) - LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval
Score Matching [33.696757740830506]
Recent advancements in text-to-3D generation have shown promise.
Many methods base themselves on Score Distillation Sampling (SDS)
We propose Interval Score Matching (ISM) to counteract over-smoothing.
arXiv Detail & Related papers (2023-11-19T09:59:09Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human
Mesh Recovery [84.67823511418334]
This paper presents 3D JOint contrastive learning with TRansformers framework for handling occluded 3D human mesh recovery.
Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$&$3D aligned results.
arXiv Detail & Related papers (2023-07-31T02:58:58Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.