Fugu-MT 論文翻訳(概要): POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

論文の概要: POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

arxiv url: http://arxiv.org/abs/2604.14029v1
Date: Wed, 15 Apr 2026 16:09:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.627098
Title: POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
Title（参考訳）: POINTS-Seeker:スクラッチからのマルチモーダルエージェント検索モデルのトレーニングに向けて
Authors: Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie,
Abstract要約: エージェント・シーディング(Agenic Seeding)は,エージェント行動の抽出に必要な前駆体を織り込むための専用フェーズである。本稿では、最近の対話を高忠実に保ちながら、歴史的コンテキストをレンダリングを介して視覚空間に折り畳みながら、適応的履歴認識圧縮方式であるV-Foldを提案する。我々は,最新のマルチモーダルエージェントサーチモデルであるPOINTS-Seeker-8Bを開発した。
参考スコア（独自算出の注目度）: 84.73366911912512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は印象的な視覚的知覚を示すが、静的なパラメトリックな知識によって認識的に制約される。これらの境界を超越するために、証拠検索のために外部環境と活発に相互作用するマルチモーダル検索モデルが採用されている。一般的なLMMをモジュラー拡張として検索ツールに適合させるだけの一般的なパラダイムから、スクラッチからマルチモーダルエージェント検索モデルを構築する可能性を探る。具体的には、以下の貢献をします。一エージェントの行動を引き出すのに必要な基礎的前駆体を織り込むための専用フェーズであるエージェントシーディングを導入する。 (II) 相互作用履歴の増大がモデルの性能を圧倒する長軸相互作用における性能ボトルネックを明らかにする。これを緩和するために、最近の対話を高忠実に保ちながら、歴史的コンテキストをレンダリングを介して視覚空間に折り畳みながら、適応的履歴認識圧縮スキームであるV-Foldを提案する。 3)PINTS-Seeker-8Bは、最先端のマルチモーダルエージェントサーチモデルであり、6つの異なるベンチマークで既存のモデルを一貫して上回り、長期的、知識集約的な視覚的推論の課題を効果的に解決する。

論文の概要: POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

関連論文リスト