Benchmarking Spatial Relationships in Text-to-Image Generation
- URL: http://arxiv.org/abs/2212.10015v3
- Date: Fri, 27 Oct 2023 17:24:04 GMT
- Title: Benchmarking Spatial Relationships in Text-to-Image Generation
- Authors: Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric
Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang
- Abstract summary: We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
- Score: 102.62422723894232
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Spatial understanding is a fundamental aspect of computer vision and integral
for human-level reasoning about images, making it an important component for
grounded language understanding. While recent text-to-image synthesis (T2I)
models have shown unprecedented improvements in photorealism, it is unclear
whether they have reliable spatial understanding capabilities. We investigate
the ability of T2I models to generate correct spatial relationships among
objects and present VISOR, an evaluation metric that captures how accurately
the spatial relationship described in text is generated in the image. To
benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that
contains sentences describing two or more objects and the spatial relationships
between them. We construct an automated evaluation pipeline to recognize
objects and their spatial relationships, and employ it in a large-scale
evaluation of T2I models. Our experiments reveal a surprising finding that,
although state-of-the-art T2I models exhibit high image quality, they are
severely limited in their ability to generate multiple objects or the specified
spatial relations between them. Our analyses demonstrate several biases and
artifacts of T2I models such as the difficulty with generating multiple
objects, a bias towards generating the first object mentioned, spatially
inconsistent outputs for equivalent relationships, and a correlation between
object co-occurrence and spatial understanding capabilities. We conduct a human
study that shows the alignment between VISOR and human judgement about spatial
understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to
the community in support of T2I reasoning research.
Related papers
- REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Getting it Right: Improving Spatial Consistency in Text-to-Image Models [103.52640413616436]
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt.
We create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets.
We find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on 500 images.
arXiv Detail & Related papers (2024-04-01T15:55:25Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
layout is employed as an intermedium to bridge large language models and layout-based diffusion models.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Intrinsic Relationship Reasoning for Small Object Detection [44.68289739449486]
Small objects in images and videos are usually not independent individuals. Instead, they more or less present some semantic and spatial layout relationships with each other.
We propose a novel context reasoning approach for small object detection which models and infers the intrinsic semantic and spatial layout relationships between objects.
arXiv Detail & Related papers (2020-09-02T06:03:05Z) - Understanding Spatial Relations through Multiple Modalities [78.07328342973611]
spatial relations between objects can either be explicit -- expressed as spatial prepositions, or implicit -- expressed by spatial verbs such as moving, walking, shifting, etc.
We introduce the task of inferring implicit and explicit spatial relations between two entities in an image.
We design a model that uses both textual and visual information to predict the spatial relations, making use of both positional and size information of objects and image embeddings.
arXiv Detail & Related papers (2020-07-19T01:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.