Related papers: ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

URL: http://arxiv.org/abs/2410.06555v1
Date: Wed, 9 Oct 2024 05:17:38 GMT
Title: ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Authors: Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang,
Abstract summary: multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks. Existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. We present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs.
Score: 40.851540679589256
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING-VP.git.

Related papers

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding [39.64540328712615]
Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications.<n>Existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations.<n>We introduce the Point-It-Out benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding.
arXiv Detail & Related papers (2025-09-30T05:05:54Z)
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models. We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z)
Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking [0.12369742273401668]
We introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles. We evaluate leading models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro. State-of-the-art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks.
arXiv Detail & Related papers (2024-11-20T01:09:21Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling [22.885385107905222]
We introduce UniBench, a unified implementation of 50+ vision-language model (VLM) benchmarks. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models. We also release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled set of benchmarks that runs in 5 minutes on a single GPU.
arXiv Detail & Related papers (2024-08-09T01:41:05Z)
Task Me Anything [72.810309406219]
This paper produces a benchmark tailored to a user's needs. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating perceptual capabilities.
arXiv Detail & Related papers (2024-06-17T17:32:42Z)
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns. We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.