Visual Spatial Reasoning
- URL: http://arxiv.org/abs/2205.00363v3
- Date: Wed, 22 Mar 2023 15:42:50 GMT
- Title: Visual Spatial Reasoning
- Authors: Fangyu Liu, Guy Emerson, Nigel Collier
- Abstract summary: We present a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English.
We show how the dataset includes challenging linguistic phenomena, such as varying reference frames.
We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%.
- Score: 35.5155400193075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial relations are a basic part of human cognition. However, they are
expressed in natural language in a variety of ways, and previous work has
suggested that current vision-and-language models (VLMs) struggle to capture
relational information. In this paper, we present Visual Spatial Reasoning
(VSR), a dataset containing more than 10k natural text-image pairs with 66
types of spatial relations in English (such as: under, in front of, and
facing). While using a seemingly simple annotation format, we show how the
dataset includes challenging linguistic phenomena, such as varying reference
frames. We demonstrate a large gap between human and model performance: the
human ceiling is above 95%, while state-of-the-art models only achieve around
70%. We observe that VLMs' by-relation performances have little correlation
with the number of training examples and the tested models are in general
incapable of recognising relations concerning the orientations of objects.
Related papers
- A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans [3.3311266423308252]
We introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy.
Our results reveal a significant knowledge gap between humans and models for almost all semantic relations.
arXiv Detail & Related papers (2024-12-02T05:11:34Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - RelVAE: Generative Pretraining for few-shot Visual Relationship
Detection [2.2230760534775915]
We present the first pretraining method for few-shot predicate classification that does not require any annotated relations.
We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets.
arXiv Detail & Related papers (2023-11-27T19:08:08Z) - What's "up" with vision-language models? Investigating their struggle
with spatial reasoning [76.2406963762722]
Three new corpora quantify model comprehension of basic spatial relations.
We evaluate 18 vision-language (VL) models, finding that all perform poorly.
We conclude by studying causes of this surprising behavior.
arXiv Detail & Related papers (2023-10-30T17:50:15Z) - STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning [4.676784872259775]
We propose a large-scale video dataset for understanding spatial relationships derived from prepositions of the English language.
The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses.
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
arXiv Detail & Related papers (2023-09-13T02:35:59Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Things not Written in Text: Exploring Spatial Commonsense from Visual
Signals [77.46233234061758]
We investigate whether models with visual signals learn more spatial commonsense than text-based models.
We propose a benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions.
We find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models.
arXiv Detail & Related papers (2022-03-15T17:02:30Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.