Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models
- URL: http://arxiv.org/abs/2308.09778v3
- Date: Wed, 6 Mar 2024 00:38:04 GMT
- Title: Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models
- Authors: Navid Rajabi, Jana Kosecka
- Abstract summary: We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
- Score: 3.86170450233149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-and-language models (VLMs) trained to match images with text on
large-scale datasets of image-text pairs have shown impressive generalization
ability on several vision and language tasks. Several recent works, however,
showed that these models lack fine-grained understanding, such as the ability
to count and recognize verbs, attributes, or relationships. The focus of this
work is to study the understanding of spatial relations. This has been tackled
previously using image-text matching (e.g., Visual Spatial Reasoning benchmark)
or visual question answering (e.g., GQA or VQAv2), both showing poor
performance and a large gap compared to human performance. In this work, we
show qualitatively (using explainability tools) and quantitatively (using
object detectors) that the poor object localization "grounding" ability of the
models is a contributing factor to the poor image-text matching performance. We
propose an alternative fine-grained, compositional approach for recognizing and
ranking spatial clauses that combines the evidence from grounding noun phrases
corresponding to objects and their locations to compute the final rank of the
spatial clause. We demonstrate the approach on representative VLMs (such as
LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason
about spatial relationships.
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.