Fugu-MT 論文翻訳(概要): VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

論文の概要: VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

arxiv url: http://arxiv.org/abs/2605.16079v1
Date: Fri, 15 May 2026 15:43:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.344374
Title: VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
Title（参考訳）: VideoSeeker: ネイティブエージェントツール呼び出しによるインスタンスレベルのビデオ理解のインセンティブ
Authors: Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao,
Abstract要約: VideoSeekerは、視覚的なプロンプトによるインスタンスレベルのビデオ理解のための新しいパラダイムである。大規模で高品質なインスタンスレベルのビデオデータを効率よく生成する4段階の完全自動データ合成パイプラインを構築した。我々のモデルは、インスタンスレベルのビデオ理解タスクのベースラインよりも平均で+13.7%向上する。
参考スコア（独自算出の注目度）: 46.226603529472065
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、ビデオ理解において大きな進歩を見せているが、インスタンスレベルでの正確な時空間的ローカライゼーションを必要とするタスクにおいて、大きな課題に直面している。既存の手法は主に人-モデル相互作用のためのテキストプロンプトに頼っているが、これらのプロンプトは正確な空間的および時間的参照を提供するのに苦労し、結果としてユーザエクスペリエンスは低下する。さらに、現在のアプローチは、一般的に言語推論から視覚的知覚を分離し、視覚的コンテンツよりも言語を中心に推論することで、モデルがより詳細な視覚的証拠を積極的に知覚する能力を制限している。これらの課題に対処するために,視覚的プロンプトによるインスタンスレベルの映像理解のための新しいパラダイムであるVideoSeekerを提案する。 VideoSeekerは、エージェントによる推論とインスタンスレベルのビデオ理解タスクをシームレスに統合することで、モデルが必要に応じて適切なビデオセグメントを積極的に知覚し、取得することを可能にする。大規模で高品質なインスタンスレベルのビデオデータを効率よく生成する4段階の完全自動データ合成パイプラインを構築した。我々は,冷間開始監視とRLトレーニングを通じて,ツールコール機能と積極的知覚能力をモデルに内包し,強力な映像理解モデルを構築する。 GPT-4o や Gemini-2.5-Pro のような強力なクローズドソースモデルに勝り、一般的なビデオ理解ベンチマーク上では効果的な転送性を示した。関連するデータセットとコードは公開されます。

論文の概要: VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

関連論文リスト