Related papers: 3D Space as a Scratchpad for Editable Text-to-Image Generation

3D Space as a Scratchpad for Editable Text-to-Image Generation

URL: http://arxiv.org/abs/2601.14602v1
Date: Wed, 21 Jan 2026 02:40:19 GMT
Title: 3D Space as a Scratchpad for Editable Text-to-Image Generation
Authors: Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha, Kevin Blackburn-Matzen,
Abstract summary: We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis.<n>Our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection.<n>Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images.
Score: 23.03603120388675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/

Related papers

Articulate3D: Zero-Shot Text-Driven 3D Object Posing [38.75075284385844]
We propose a training-free method, Articulate3D, to pose a 3D asset through language control.<n>We modify a powerful image-generator to create target images conditioned on the input image and a text instruction.<n>We then align the mesh to the target images through a multi-view pose optimisation step.
arXiv Detail & Related papers (2025-08-26T17:59:17Z)
UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding [65.60549881706959]
We introduce UniUGG, the first unified understanding and generation framework for 3D modalities.<n>Our framework employs an LLM to comprehend and decode sentences and 3D representations.<n>We propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations.
arXiv Detail & Related papers (2025-08-16T07:27:31Z)
GenSpace: Benchmarking Spatially-Aware Image Generation [76.98817635685278]
Humans intuitively compose and arrange scenes in the 3D space for photography.<n>Can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts?<n>We present GenSpace, a novel benchmark and evaluation pipeline to assess the spatial awareness of current image generation models.
arXiv Detail & Related papers (2025-05-30T17:59:26Z)
SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation [122.47961178994456]
SeMv-3D is a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation.<n>At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors.<n>We also present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis.
arXiv Detail & Related papers (2024-10-10T07:02:06Z)
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment [24.63428589906294]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.<n>Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.<n>During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z)
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles. Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z)
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z)
Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation [45.69270771487455]
We propose a new method of Fantasia3D for high-quality text-to-3D content creation. Key to Fantasia3D is the disentangled modeling and learning of geometry and appearance. Our framework is more compatible with popular graphics engines, supporting relighting, editing, and physical simulation of the generated 3D assets.
arXiv Detail & Related papers (2023-03-24T09:30:09Z)
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts. We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.