Related papers: CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2412.13195v2
Date: Mon, 25 Aug 2025 17:59:59 GMT
Title: CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
Authors: Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, Xinguo Liu,
Abstract summary: CoMPaSS is a versatile framework that enhances spatial understanding in T2I models.<n>It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine.<n>To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module.
Score: 18.89863162308386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) diffusion models excel at generating photorealistic images but often fail to render accurate spatial relationships. We identify two core issues underlying this common failure: 1) the ambiguous nature of data concerning spatial relationships in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We propose CoMPaSS, a versatile framework that enhances spatial understanding in T2I models. It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data via principled constraints. To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module, which preserves crucial token ordering information lost by text encoders, thereby reinforcing the prompt's linguistic structure. Extensive experiments on four popular T2I models (UNet and MMDiT-based) show CoMPaSS sets a new state of the art on key spatial benchmarks, with substantial relative gains on VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code is available at https://github.com/blurgyy/CoMPaSS.

Related papers

Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers [10.129229578687083]
Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate correct spatial relations between objects as specified in the text prompt.<n>We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt.<n>Although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder.
arXiv Detail & Related papers (2026-01-09T22:17:46Z)
InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models [27.206678799411645]
InfSplign is a training-free inference-time method for text-to-image models.<n>It improves spatial alignment by adjusting the noise through a compound loss in every denoising step.<n>It achieves substantial performance gains over the strongest existing inference-time baselines.
arXiv Detail & Related papers (2025-12-19T17:52:43Z)
ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M. We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z)
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis [5.869767284889891]
Diffusion-based text-to-image (T2I) models have excelled in high-quality image generation. We propose STORM, a novel training-free approach for spatially coherent T2I synthesis.
arXiv Detail & Related papers (2025-03-28T06:12:25Z)
Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures. GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z)
HSLiNets: Hyperspectral Image and LiDAR Data Fusion Using Efficient Dual Non-Linear Feature Learning Networks [7.06787067270941]
The integration of hyperspectral imaging (HSI) and LiDAR data within new linear feature spaces offers a promising solution to the challenges posed by the high-dimensionality and redundancy inherent in HSIs.<n>This study introduces a dual linear fused space framework that capitalizes on bidirectional reversed convolutional neural network (CNN) pathways, coupled with a specialized spatial analysis block.<n>The proposed method not only enhances data processing and classification accuracy, but also mitigates the computational burden typically associated with advanced models such as Transformers.
arXiv Detail & Related papers (2024-11-30T01:08:08Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
Getting it Right: Improving Spatial Consistency in Text-to-Image Models [103.52640413616436]
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. We create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. We find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on 500 images.
arXiv Detail & Related papers (2024-04-01T15:55:25Z)
Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN) The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation [56.44946660061753]
This paper proposes a universal regularization technique called maximum spatial perturbation consistency (MSPC) MSPC enforces a spatial perturbation function (T ) and the translation operator (G) to be commutative (i.e., TG = GT ) Our method outperforms the state-of-the-art methods on most I2I benchmarks.
arXiv Detail & Related papers (2022-03-23T19:59:04Z)
IGAN: Inferent and Generative Adversarial Networks [0.0]
IGAN learns both a generative and an inference model on a complex high dimensional data distribution. It extends the traditional GAN framework with inference by rewriting the adversarial strategy in both the image and the latent space.
arXiv Detail & Related papers (2021-09-27T21:48:35Z)
Causal Contextual Prediction for Learned Image Compression [36.08393281509613]
We propose the concept of separate entropy coding to leverage a serial decoding process for causal contextual entropy prediction in the latent space. A causal context model is proposed that separates the latents across channels and makes use of cross-channel relationships to generate highly informative contexts. We also propose a causal global prediction model, which is able to find global reference points for accurate predictions of unknown points.
arXiv Detail & Related papers (2020-11-19T08:15:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.