Related papers: M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

URL: http://arxiv.org/abs/2511.17729v1
Date: Fri, 21 Nov 2025 19:27:02 GMT
Title: M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Authors: Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, Dimitris N. Metaxas,
Abstract summary: M3-Bench is the first benchmark for evaluating multimodal tool use under the Model Context Protocol.<n>We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching.<n>The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification.
Score: 45.755057449698825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench

Related papers

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model [57.89395815934156]
Multi-Turn Contrastive Learning (MuCo) is a dialogue-inspired framework that revisits this process.<n>Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T)
arXiv Detail & Related papers (2026-02-06T05:18:33Z)
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers [5.463884405989425]
We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency.<n>It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step orchestrate.<n>We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer.
arXiv Detail & Related papers (2026-01-31T23:19:39Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark [6.470909719300937]
Model Context Protocol (MCP) provides a standardized way to supply context to AI Agents.<n> evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues.<n>We propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark.
arXiv Detail & Related papers (2025-08-11T03:16:02Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges [30.68589269821412]
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on single-turn interactions.<n>We propose textttDialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use.
arXiv Detail & Related papers (2025-05-19T16:36:13Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC) The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.