MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
- URL: http://arxiv.org/abs/2603.01050v1
- Date: Sun, 01 Mar 2026 11:13:22 GMT
- Title: MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
- Authors: Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang,
- Abstract summary: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis.<n>We observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs.<n>With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks.
- Score: 26.19213349415094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch
Related papers
- VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning [22.27364585438247]
VSearcher is a multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments.<n>We introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions.<n>We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments.
arXiv Detail & Related papers (2026-03-03T09:33:22Z) - DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent [63.52637950356965]
DeepResearch-9K is a large-scale, challenging dataset for deep-research scenarios.<n>DeepResearch-R1 is an open-source training framework for deep-research agents.
arXiv Detail & Related papers (2026-03-01T15:36:10Z) - Revisiting Text Ranking in Deep Research [24.324221566628125]
Black-box web search APIs hinder systematic analysis of search components.<n>We reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting.
arXiv Detail & Related papers (2026-02-25T00:18:07Z) - DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search [61.77858432092777]
We present DeepMMSearch-R1, the first multimodal large language model capable of performing on-demand, multi-turn web searches.<n>DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective.<n>We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-10-14T17:59:58Z) - Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs [7.3517692707289415]
We introduce Fathom-DeepResearch, an agentic system composed of two specialized models.<n>The first is Fathom-Search-4B, a DeepSearch model optimized for evidence-based investigation through live web search and targeted webpage querying.<n>The second is Fathom- Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports.
arXiv Detail & Related papers (2025-09-28T22:58:11Z) - HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches [54.65565885083031]
We propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL.<n>At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains.<n>At the high level, a planner agent coordinates low-level agents and provides the final answer.
arXiv Detail & Related papers (2025-08-11T15:31:47Z) - MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z) - ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework [73.91207117772291]
ManuSearch is a transparent and modular multi-agent framework designed to democratize deep search for large language models (LLMs)<n>ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content.
arXiv Detail & Related papers (2025-05-23T17:02:02Z) - MindSearch: Mimicking Human Minds Elicits Deep AI Searcher [50.68599514830046]
We introduce MindSearch to mimic the human minds in web information seeking and integration.<n>The framework can be instantiated by a simple yet effective LLM-based multi-agent framework.<n> MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth.
arXiv Detail & Related papers (2024-07-29T17:12:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.