TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt
- URL: http://arxiv.org/abs/2410.21299v2
- Date: Thu, 31 Oct 2024 02:13:44 GMT
- Title: TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt
- Authors: Jiahui Yang, Donglin Di, Baorui Ma, Xun Yang, Yongjia Ma, Wenzhang Sun, Wei Chen, Jianxun Cui, Zhou Xue, Meng Wang, Yebin Liu,
- Abstract summary: We propose a novel algorithm, Score Matching (CSM), which removes the difference term in Score Distillation Sampling (SDS)
We integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM algorithm.
We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation.
- Score: 41.880416357543616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation. Project page: \url{https://yjhboy.github.io/TV-3DG}
Related papers
- CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback [40.163073128022944]
Textual Coherent Score Distillation (TCSD) integrates alignment feedback from multimodal large language models (MLLMs)
3DLLaVA-CRITIC is a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations.
CoherenDream establishes state-of-the-art performance in text-aligned 3D generation across multiple benchmarks.
arXiv Detail & Related papers (2025-04-28T14:50:45Z) - RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling [14.725841457150414]
RewardSDS weights noise samples based on alignment scores from a reward model, producing a weighted SDS loss.
This loss prioritizes gradients from noise samples that yield aligned high-reward output.
We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks.
arXiv Detail & Related papers (2025-03-12T17:59:47Z) - Semantic Score Distillation Sampling for Compositional Text-to-3D Generation [28.88237230872795]
Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research.
We introduce a novel SDS approach, designed to improve the expressiveness and accuracy of compositional text-to-3D generation.
Our approach integrates new semantic embeddings that maintain consistency across different rendering views.
By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-11T17:26:00Z) - MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification [13.872254142378772]
This paper introduces a unified framework for text-to-3D content generation.
Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model.
We also introduce a novel densification algorithm that aligns gaussians close to the surface.
arXiv Detail & Related papers (2024-09-10T16:16:34Z) - VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation [69.68568248073747]
We propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks.
PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps.
For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details.
arXiv Detail & Related papers (2024-06-21T08:21:52Z) - ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching [10.362259643427526]
Current approaches often adapt pre-trained 2D diffusion models for 3D synthesis.
Over-smoothing poses a significant limitation on the high-fidelity generation of 3D models.
LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM)
arXiv Detail & Related papers (2024-05-24T20:19:45Z) - Flow Score Distillation for Diverse Text-to-3D Generation [23.38418695449777]
Flow Score Distillation (FSD) substantially enhances generation diversity without compromising quality.
Our validation experiments across various text-to-image Diffusion Models demonstrate that FSD substantially enhances generation diversity without compromising quality.
arXiv Detail & Related papers (2024-05-16T06:05:16Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Learn to Optimize Denoising Scores for 3D Generation: A Unified and
Improved Diffusion Prior on NeRF and 3D Gaussian Splatting [60.393072253444934]
We propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks.
We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation.
arXiv Detail & Related papers (2023-12-08T03:55:34Z) - StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances.
First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss.
Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z) - LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval
Score Matching [33.696757740830506]
Recent advancements in text-to-3D generation have shown promise.
Many methods base themselves on Score Distillation Sampling (SDS)
We propose Interval Score Matching (ISM) to counteract over-smoothing.
arXiv Detail & Related papers (2023-11-19T09:59:09Z) - Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction [138.04956118993934]
We propose a novel Transformer-based method, coarse-to-fine sparse Transformer (CST)
CST embedding HSI sparsity into deep learning for HSI reconstruction.
In particular, CST uses our proposed spectra-aware screening mechanism (SASM) for coarse patch selecting. Then the selected patches are fed into our customized spectra-aggregation hashing multi-head self-attention (SAH-MSA) for fine pixel clustering and self-similarity capturing.
arXiv Detail & Related papers (2022-03-09T16:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.