Related papers: Geometry-Aware Scene-Consistent Image Generation

Geometry-Aware Scene-Consistent Image Generation

URL: http://arxiv.org/abs/2512.12598v1
Date: Sun, 14 Dec 2025 08:35:04 GMT
Title: Geometry-Aware Scene-Consistent Image Generation
Authors: Cong Xie, Che Wang, Yan Zhang, Zheng Pan, Han Zou, Zhenpeng Zhan,
Abstract summary: We study geometry-aware scene-consistent image generation.<n>The goal is to synthesize an output image that preserves the same physical environment as the reference scene.<n>We introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss.
Score: 14.644679152141904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model's spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.

Related papers

All-in-One Conditioning for Text-to-Image Synthesis [45.22434803596108]
We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures.<n>We introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference.<n>This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
arXiv Detail & Related papers (2026-02-09T20:16:19Z)
Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt [14.734857939203811]
We propose a training-free approach that addresses semantic entanglement from a subject perspective.<n>Our approach significantly improves both subject consistency and text alignment over existing baselines.
arXiv Detail & Related papers (2025-12-18T11:55:06Z)
LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning [18.207887244259897]
Designing realistic multi-object scenes requires planning spatial layouts that respect semantic relations and physical plausibility.<n>We propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation.<n>Our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting.<n>In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts.
arXiv Detail & Related papers (2025-09-24T20:41:04Z)
Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models [20.508585767918916]
In this work, we leverage intrinsic scene properties that provide rich information about the underlying scene.<n>Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure.<n> Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes.
arXiv Detail & Related papers (2025-08-14T06:26:36Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion. We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z)
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation. Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z)
Self-Supervised Image Representation Learning with Geometric Set Consistency [50.12720780102395]
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency. Specifically, we introduce 3D geometric consistency into a contrastive learning framework to enforce the feature consistency within image views.
arXiv Detail & Related papers (2022-03-29T08:57:33Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space'' This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)
Guidance and Evaluation: Semantic-Aware Image Inpainting for Mixed Scenes [54.836331922449666]
We propose a Semantic Guidance and Evaluation Network (SGE-Net) to update the structural priors and the inpainted image. It utilizes semantic segmentation map as guidance in each scale of inpainting, under which location-dependent inferences are re-evaluated. Experiments on real-world images of mixed scenes demonstrated the superiority of our proposed method over state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-15T17:49:20Z)
Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.