Fugu-MT 論文翻訳(概要): DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

論文の概要: DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

arxiv url: http://arxiv.org/abs/2510.12801v1
Date: Tue, 14 Oct 2025 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.447462
Title: DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Title（参考訳）: DeepMMSearch-R1:マルチモーダルWeb検索におけるマルチモーダルLLMの活用
Authors: Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan,
Abstract要約: DeepMMSearch-R1は,オンデマンドでマルチターンWeb検索が可能な,最初のマルチモーダルな大規模言語モデルである。 DeepMMSearch-R1は、画像検索をより効果的にするために、入力画像の関連する作物に基づいてWeb検索を開始することができる。我々は、アプローチの優位性を実証するために、知識集約型ベンチマークを幅広く実施する。
参考スコア（独自算出の注目度）: 61.77858432092777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.
Abstract（参考訳）: 実世界のアプリケーションにおけるマルチモーダル大言語モデル(MLLM)は、外部の知識ソースへのアクセスを必要とし、情報検索や知識集約型ユーザクエリに対処するためには、動的かつ絶え間なく変化する現実世界の情報に応答し続けなければならない。検索拡張生成法(RAG)や検索エージェント、検索機能を備えたMLLMといった既存のアプローチは、多くの場合、厳密なパイプライン、過剰な検索呼び出し、貧弱な構築された検索クエリに悩まされ、非効率性や準最適結果をもたらす。この制限に対処するため、DeepMMSearch-R1は、オンデマンドでマルチターンWeb検索を実行し、画像検索とテキスト検索の両方のクエリを動的に作成できる最初のマルチモーダルLLMである。具体的には、DeepMMSearch-R1は、入力画像の関連作物に基づいてWeb検索を開始し、画像検索をより効果的にし、検索された情報に基づいてテキスト検索クエリを反復的に適応させることにより、自己回帰と自己補正を可能にする。我々のアプローチは2段階のトレーニングパイプラインに依存しており、コールドスタートによる微調整フェーズとオンライン強化学習最適化が続く。 DeepMMSearchVQAは、Web検索ツールから現実の情報と混在する自動パイプラインを通じて作成される、新しいマルチモーダルVQAデータセットである。このデータセットには多種多様なマルチホップクエリが含まれており、テキスト情報と視覚情報を統合し、モデルをいつ検索するか、何を探すか、どの検索ツールを使うか、検索した情報をどう解釈するかを教えている。我々は、アプローチの優位性を実証するために、知識集約型ベンチマークを幅広く実施する。最後に、結果を分析し、マルチモーダルWeb検索を進める上で価値のある洞察を提供する。

論文の概要: DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

関連論文リスト