Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models
- URL: http://arxiv.org/abs/2506.23418v1
- Date: Sun, 29 Jun 2025 22:41:27 GMT
- Title: Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models
- Authors: Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban,
- Abstract summary: A prevalent issue in compositional generation is the misalignment of spatial relationships.<n>We introduce a novel evaluation metric designed to assess the alignment of 2D and 3D spatial relationships between text and image.<n>We also propose PoS-based Generation, an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning.
- Score: 3.5999252362400993
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a Part-of-Speech PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.
Related papers
- SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning [23.984926906083473]
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes.<n>Recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting.<n>We propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG.
arXiv Detail & Related papers (2025-07-08T09:03:24Z) - ComposeAnything: Composite Object Priors for Text-to-Image Generation [72.98469853839246]
ComposeAnything is a novel framework for improving compositional image generation without retraining existing T2I models.<n>Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text.<n>Our model generates high-quality images with compositions that faithfully reflect the text.
arXiv Detail & Related papers (2025-05-30T00:13:36Z) - ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images.<n>However, they still struggle to properly render the spatial relationships described in text prompts.<n>Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M.<n>We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z) - Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z) - BIFRÖST: 3D-Aware Image compositing with Language Instructions [27.484947109237964]
Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition.
Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
arXiv Detail & Related papers (2024-10-24T18:35:12Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.