Related papers: V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

URL: http://arxiv.org/abs/2504.06148v2
Date: Fri, 16 May 2025 12:29:40 GMT
Title: V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Authors: Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang,
Abstract summary: We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
Score: 84.27290155010533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation(V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elo-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.

Related papers

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
Are Large Vision Language Models Good Game Players? [25.49713745405194]
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information.<n>Existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering, often fail to capture the full scope of LVLMs' capabilities.<n>We propose method, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments.
arXiv Detail & Related papers (2025-03-04T07:29:03Z)
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment.<n>We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs) LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts. This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z)
Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context. Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language. In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs. Our findings reveal significant discrepancies in reasoning abilities across different models. We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)
Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks. In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space. We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.