Related papers: MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

URL: http://arxiv.org/abs/2502.10886v1
Date: Sat, 15 Feb 2025 19:39:58 GMT
Title: MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
Authors: Vanya Cohen, Raymond Mooney,
Abstract summary: We introduce MET-Bench, a benchmark designed to evaluate the ability of vision-language models to track entity states across modalities.<n>Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

Related papers

Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
arXiv Detail & Related papers (2025-03-05T19:36:43Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline. We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task. We propose a Multi-modal Context Reasoning approach, named ModCR. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z)
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? [0.06299766708197882]
We create a new task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
arXiv Detail & Related papers (2022-10-21T16:07:00Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Cross-Modality Relevance for Reasoning on Language and Vision [22.41781462637622]
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR) We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task. Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results.
arXiv Detail & Related papers (2020-05-12T20:17:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.