Benchmarking Spatial Relationships in Text-to-Image Generation
- URL: http://arxiv.org/abs/2212.10015v3
- Date: Fri, 27 Oct 2023 17:24:04 GMT
- Title: Benchmarking Spatial Relationships in Text-to-Image Generation
- Authors: Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric
Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang
- Abstract summary: We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
- Score: 102.62422723894232
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Spatial understanding is a fundamental aspect of computer vision and integral
for human-level reasoning about images, making it an important component for
grounded language understanding. While recent text-to-image synthesis (T2I)
models have shown unprecedented improvements in photorealism, it is unclear
whether they have reliable spatial understanding capabilities. We investigate
the ability of T2I models to generate correct spatial relationships among
objects and present VISOR, an evaluation metric that captures how accurately
the spatial relationship described in text is generated in the image. To
benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that
contains sentences describing two or more objects and the spatial relationships
between them. We construct an automated evaluation pipeline to recognize
objects and their spatial relationships, and employ it in a large-scale
evaluation of T2I models. Our experiments reveal a surprising finding that,
although state-of-the-art T2I models exhibit high image quality, they are
severely limited in their ability to generate multiple objects or the specified
spatial relations between them. Our analyses demonstrate several biases and
artifacts of T2I models such as the difficulty with generating multiple
objects, a bias towards generating the first object mentioned, spatially
inconsistent outputs for equivalent relationships, and a correlation between
object co-occurrence and spatial understanding capabilities. We conduct a human
study that shows the alignment between VISOR and human judgement about spatial
understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to
the community in support of T2I reasoning research.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Getting it Right: Improving Spatial Consistency in Text-to-Image Models [103.52640413616436]
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt.
We create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets.
We attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on 500 images.
arXiv Detail & Related papers (2024-04-01T15:55:25Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
Our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning [5.256237513030104]
We propose a large-scale video dataset for understanding spatial relationships derived from prepositions of the English language.
The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses.
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
arXiv Detail & Related papers (2023-09-13T02:35:59Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Intrinsic Relationship Reasoning for Small Object Detection [44.68289739449486]
Small objects in images and videos are usually not independent individuals. Instead, they more or less present some semantic and spatial layout relationships with each other.
We propose a novel context reasoning approach for small object detection which models and infers the intrinsic semantic and spatial layout relationships between objects.
arXiv Detail & Related papers (2020-09-02T06:03:05Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Understanding Spatial Relations through Multiple Modalities [78.07328342973611]
spatial relations between objects can either be explicit -- expressed as spatial prepositions, or implicit -- expressed by spatial verbs such as moving, walking, shifting, etc.
We introduce the task of inferring implicit and explicit spatial relations between two entities in an image.
We design a model that uses both textual and visual information to predict the spatial relations, making use of both positional and size information of objects and image embeddings.
arXiv Detail & Related papers (2020-07-19T01:35:08Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.