Structured Information for Improving Spatial Relationships in Text-to-Image Generation
- URL: http://arxiv.org/abs/2509.15962v1
- Date: Fri, 19 Sep 2025 13:20:34 GMT
- Title: Structured Information for Improving Spatial Relationships in Text-to-Image Generation
- Authors: Sander Schildermans, Chang Tian, Ying Jiao, Marie-Francine Moens,
- Abstract summary: This work introduces a lightweight approach that augments prompts with structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines.<n> Experimental results demonstrate substantial improvements in spatial accuracy, without compromising image quality as measured by Inception Score.<n>This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current generative systems.
- Score: 23.552628360388823
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-image (T2I) generation has advanced rapidly, yet faithfully capturing spatial relationships described in natural language prompts remains a major challenge. Prior efforts have addressed this issue through prompt optimization, spatially grounded generation, and semantic refinement. This work introduces a lightweight approach that augments prompts with tuple-based structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines. Experimental results demonstrate substantial improvements in spatial accuracy, without compromising overall image quality as measured by Inception Score. Furthermore, the automatically generated tuples exhibit quality comparable to human-crafted tuples. This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current large-scale generative systems.
Related papers
- RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z) - IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z) - ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images.<n>However, they still struggle to properly render the spatial relationships described in text prompts.<n>Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M.<n>We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z) - Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis [5.869767284889891]
Diffusion-based text-to-image (T2I) models have excelled in high-quality image generation.<n>We propose STORM, a novel training-free approach for spatially coherent T2I synthesis.
arXiv Detail & Related papers (2025-03-28T06:12:25Z) - Enhancing RWKV-based Language Models for Long-Sequence Text Generation [0.0]
This paper introduces an enhanced RWKV architecture with adaptive temporal gating mechanisms for improved long-context language modeling.<n>We propose two principal innovations: (1) a position-aware convolutional shift operator that captures local syntactic patterns while preserving global coherence, and (2) a neurally-gated information routing mechanism that dynamically regulates inter-token information flow.
arXiv Detail & Related papers (2025-02-21T14:18:18Z) - CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models [18.89863162308386]
CoMPaSS is a versatile framework that enhances spatial understanding in T2I models.<n>It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine.<n>To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module.
arXiv Detail & Related papers (2024-12-17T18:59:50Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.