MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
- URL: http://arxiv.org/abs/2603.00873v1
- Date: Sun, 01 Mar 2026 02:25:57 GMT
- Title: MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
- Authors: Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, Jingrui He,
- Abstract summary: We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures.<n>Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy.<n>By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning.
- Score: 79.14584837105808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
Related papers
- MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems [59.20800753428596]
We present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS)<n>Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models)<n>We find that process-level verification does not consistently improve performance and frequently exhibits high variance.
arXiv Detail & Related papers (2026-02-03T03:30:36Z) - Beyond Monolithic Architectures: A Multi-Agent Search and Knowledge Optimization Framework for Agentic Search [56.78490647843876]
Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use.<n>We propose bfM-ASK, a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context.
arXiv Detail & Related papers (2026-01-08T08:13:27Z) - MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents [44.63565009665076]
We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding.<n>We provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module.<n>SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning.
arXiv Detail & Related papers (2025-08-29T09:58:27Z) - MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning [36.3918410061572]
MA-RAG addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks.<n>Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents.<n>Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-26T15:05:18Z) - REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z) - MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs.<n>We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Self-prompted Chain-of-Thought on Large Language Models for Open-domain
Multi-hop Reasoning [70.74928578278957]
In open-domain question-answering (ODQA), most existing questions require single-hop reasoning on commonsense.
Large language models (LLMs) have found significant utility in facilitating ODQA without external corpus.
We propose Self-prompted Chain-of-Thought (SP-CoT), an automated framework to mass-produce high quality CoTs.
arXiv Detail & Related papers (2023-10-20T14:51:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.