Related papers: DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

URL: http://arxiv.org/abs/2510.12801v1
Date: Tue, 14 Oct 2025 17:59:58 GMT
Title: DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Authors: Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan,
Abstract summary: We present DeepMMSearch-R1, the first multimodal large language model capable of performing on-demand, multi-turn web searches.<n>DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective.<n>We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach.
Score: 61.77858432092777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

Related papers

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning [22.27364585438247]
VSearcher is a multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments.<n>We introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions.<n>We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments.
arXiv Detail & Related papers (2026-03-03T09:33:22Z)
MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline [26.19213349415094]
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis.<n>We observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs.<n>With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks.
arXiv Detail & Related papers (2026-03-01T11:13:22Z)
MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z)
ZeroSearch: Incentivize the Search Capability of LLMs without Searching [69.55482019211597]
We introduce ZeroSearch, a framework that incentivizes the capabilities of large language models to use a real search engine with simulated searches during training.<n>Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents.
arXiv Detail & Related papers (2025-05-07T17:30:22Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z)
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines [95.48317207225378]
Large Multimodal Models (LMMs) have made impressive strides in AI search engines.<n>But, whether they can function as AI search engines remains under-explored.<n>We first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities.
arXiv Detail & Related papers (2024-09-19T17:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.