MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
- URL: http://arxiv.org/abs/2502.10886v1
- Date: Sat, 15 Feb 2025 19:39:58 GMT
- Title: MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
- Authors: Vanya Cohen, Raymond Mooney,
- Abstract summary: We introduce MET-Bench, a benchmark designed to evaluate the ability of vision-language models to track entity states across modalities.
Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception.
- Score: 0.0
- License:
- Abstract: Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
Related papers
- IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames.
Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability.
We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
Dependencies? [0.06299766708197882]
We create a new task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup.
We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably.
This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
arXiv Detail & Related papers (2022-10-21T16:07:00Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Cross-Modality Relevance for Reasoning on Language and Vision [22.41781462637622]
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR)
We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task.
Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results.
arXiv Detail & Related papers (2020-05-12T20:17:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.