CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
- URL: http://arxiv.org/abs/2507.04451v1
- Date: Sun, 06 Jul 2025 16:17:32 GMT
- Title: CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
- Authors: Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan,
- Abstract summary: CoT-Diff is a framework that brings step-by-step CoT-style reasoning into T2I generation.<n>CoT-Diff tightly integrates Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process.<n> Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity.
- Score: 37.449561703903505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
Related papers
- PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion [14.879669869466072]
PartDiffuser is a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation.<n>PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism.<n> Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail.
arXiv Detail & Related papers (2025-11-24T06:11:21Z) - Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z) - AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation [56.399153019429605]
This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues.<n>We reformulate text-to-3D optimization as mapping a dynamically evolving source distribution to a fixed target distribution.<n>We introduce AnchorDS, an improved score distillation mechanism that provides state-anchored guidance with image conditions.
arXiv Detail & Related papers (2025-11-12T09:51:23Z) - SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion [0.0]
We present a novel framework for dynamic 3D scene reconstruction that integrates three key components.<n>An explicit tri-plane deformation field, a view-conditioned canonical field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior.<n>Our method encodes 4D scenes using three 2D feature planes that evolve over time, enabling efficient compact representation.
arXiv Detail & Related papers (2025-05-22T11:25:38Z) - ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images.<n>However, they still struggle to properly render the spatial relationships described in text prompts.<n>Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M.<n>We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z) - DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model [7.165879904419689]
DisentTalk presents a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control.<n>To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset.
arXiv Detail & Related papers (2025-03-24T11:46:34Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - BIFRÖST: 3D-Aware Image compositing with Language Instructions [27.484947109237964]
Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition.
Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
arXiv Detail & Related papers (2024-10-24T18:35:12Z) - TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting.<n>We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt.<n>Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing.
SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent.
Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z) - SC-Diff: 3D Shape Completion with Latent Diffusion Models [4.261508855254493]
We present a novel 3D shape completion framework that unifies multimodal conditioning.<n>Shapes are represented as Truncated Signed Distance Functions (TSDFs) and encoded into a discrete latent space jointly supervised by 2D and 3D cues.<n>Our approach guides the generation process with flexible multimodal conditioning, ensuring consistent integration of 2D and 3D information.
arXiv Detail & Related papers (2024-03-19T06:01:11Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Compositional 3D Scene Generation using Locally Conditioned Diffusion [49.5784841881488]
We introduce textbflocally conditioned diffusion as an approach to compositional scene diffusion.
We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
arXiv Detail & Related papers (2023-03-21T22:37:16Z) - LayoutDiffusion: Improving Graphic Layout Generation by Discrete
Diffusion Probabilistic Models [50.73105631853759]
We present a novel generative model named LayoutDiffusion for automatic layout generation.
It learns to reverse a mild forward process, in which layouts become increasingly chaotic with the growth of forward steps.
It enables two conditional layout generation tasks in a plug-and-play manner without re-training and achieves better performance than existing methods.
arXiv Detail & Related papers (2023-03-21T04:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.