Related papers: VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

URL: http://arxiv.org/abs/2506.02387v2
Date: Tue, 30 Sep 2025 06:49:16 GMT
Title: VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
Authors: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang,
Abstract summary: We introduce Visual Strategic Bench (VS-Bench), a benchmark that evaluates Vision Language Models for strategic abilities in multi-agent environments.<n>The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return.
Score: 25.534332634912005
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

Related papers

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts [19.97430860742638]
We present a game theory-based evaluation platform that measures large language models' decision-making strategies and social behaviors in classic game-theoretic settings.<n>Our system cross-evaluates 15 leading LLMs using leaderboard rankings and scoring mechanisms.<n>This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios.
arXiv Detail & Related papers (2025-09-20T10:21:17Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
Evaluating the Sensitivity of LLMs to Prior Context [2.377922603550519]
Large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios.<n>We introduce a novel set of benchmarks that vary the volume and nature of prior context to measure sensitivity to contextual variations.<n>Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions.
arXiv Detail & Related papers (2025-05-29T16:09:32Z)
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities [5.0778942095543576]
This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of Large Language Models.<n>We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3.<n>Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment.
arXiv Detail & Related papers (2025-05-19T14:50:44Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment.<n>We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z)
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs [70.4578433679737]
We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks.<n>Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs.<n>The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension.
arXiv Detail & Related papers (2025-01-03T23:03:24Z)
MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.<n>Existing benchmarks mostly assess their reasoning abilities in language part.<n>MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z)
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games [44.16513620589459]
We introduce BALROG, a novel benchmark to assess the agentic capabilities of Large Language Models (LLMs) and Vision Language Models (VLMs)<n>Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master.<n>Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks.
arXiv Detail & Related papers (2024-11-20T18:54:32Z)
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback [33.532239489610056]
FB-Bench is a benchmark designed to evaluate Large Language Models' responsiveness to human feedback under real-world usage scenarios in Chinese.<n>We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios.<n>Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research.
arXiv Detail & Related papers (2024-10-12T07:40:01Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models [21.438427686724932]
We introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction. Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods.
arXiv Detail & Related papers (2023-10-20T13:14:38Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.