What's "up" with vision-language models? Investigating their struggle
with spatial reasoning
- URL: http://arxiv.org/abs/2310.19785v1
- Date: Mon, 30 Oct 2023 17:50:15 GMT
- Title: What's "up" with vision-language models? Investigating their struggle
with spatial reasoning
- Authors: Amita Kamath, Jack Hessel, Kai-Wei Chang
- Abstract summary: Three new corpora quantify model comprehension of basic spatial relations.
We evaluate 18 vision-language (VL) models, finding that all perform poorly.
We conclude by studying causes of this surprising behavior.
- Score: 76.2406963762722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision-language (VL) models are powerful, but can they reliably
distinguish "right" from "left"? We curate three new corpora to quantify model
comprehension of such basic spatial relations. These tests isolate spatial
reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp
benchmark contains sets of photographs varying only the spatial relations of
objects, keeping their identity fixed (see Figure 1: models must comprehend not
only the usual case of a dog under a table, but also, the same dog on top of
the same table). We evaluate 18 VL models, finding that all perform poorly,
e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56%
accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of
this surprising behavior, finding: 1) that popular vision-language pretraining
corpora like LAION-2B contain little reliable data for learning spatial
relationships; and 2) that basic modeling interventions like up-weighting
preposition-containing instances or fine-tuning on our corpora are not
sufficient to address the challenges our benchmarks pose. We are hopeful that
these corpora will facilitate further research, and we release our data and
code at https://github.com/amitakamath/whatsup_vlms.
Related papers
- Evaluating the Generation of Spatial Relations in Text and Image Generative Models [4.281091463408283]
spatial relations are naturally understood in a visuo-spatial manner.
We develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs.
Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities.
arXiv Detail & Related papers (2024-11-12T09:30:02Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in
Texts [12.254118455438535]
In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts.
We also propose a Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks.
arXiv Detail & Related papers (2022-04-18T12:46:46Z) - Partial success in closing the gap between human and machine vision [30.78663978510427]
A few years ago, the first CNN surpassed human performance on ImageNet.
Here we ask: Are we making progress in closing the gap between human and machine vision?
We tested human observers on a broad range of out-of-distribution (OOD) datasets.
arXiv Detail & Related papers (2021-06-14T13:23:35Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult [22.04829832439774]
We study the ability of neural LMs to ask (not answer) reasonable questions.
We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
arXiv Detail & Related papers (2021-04-05T14:56:09Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z) - Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D [71.11034329713058]
Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
arXiv Detail & Related papers (2020-12-03T01:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.