Related papers: Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

URL: http://arxiv.org/abs/2404.01197v1
Date: Mon, 1 Apr 2024 15:55:25 GMT
Title: Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang,
Abstract summary: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. We create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. We attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on 500 images.
Score: 103.52640413616436
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

Related papers

Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models [3.5999252362400993]
A prevalent issue in compositional generation is the misalignment of spatial relationships.<n>We introduce a novel evaluation metric designed to assess the alignment of 2D and 3D spatial relationships between text and image.<n>We also propose PoS-based Generation, an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning.
arXiv Detail & Related papers (2025-06-29T22:41:27Z)
ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M. We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z)
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models [13.992486106252716]
CoMPaSS is a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine. To better exploit the curated high-quality spatial priors, CoMPaSS introduces a Token ENcoding ORdering (TENOR) module.
arXiv Detail & Related papers (2024-12-17T18:59:50Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Spatial Latent Representations in Generative Adversarial Networks for Image Generation [0.0]
We define a family of spatial latent spaces for StyleGAN2. We show that our spaces are effective for image manipulation and encode semantic information well.
arXiv Detail & Related papers (2023-03-25T20:01:11Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval [142.047662926209]
We propose a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. We generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets.
arXiv Detail & Related papers (2022-07-29T01:21:54Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph. We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.