Fugu-MT 論文翻訳(概要): VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

論文の概要: VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

arxiv url: http://arxiv.org/abs/2603.25021v1
Date: Thu, 26 Mar 2026 04:37:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.099423
Title: VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning
Title（参考訳）: VideoTIR: 効率的なツール統合推論による長時間ビデオの正確な理解
Authors: Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu, Xing W, Wenbin Li, Qi Fan, Yang Gao, Dacheng Tao,
Abstract要約: VideoTIRはZero-RLとSFTのコールドスタートを探索し、MLLMが意味のあるビデオセグメント/画像/領域を検索し、フォーカスできるようにする。我々は,高品質な軌道データを生成するサンドボックスベースの軌道合成フレームワークを開発した。
参考スコア（独自算出の注目度）: 47.619860680226964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.
Abstract（参考訳）: 既存のMLLM(Multimodal Large Language Models)は、長いビデオ理解(LVU)において幻覚に悩まされることが多い。 MLLMが短い視覚入力をうまく処理するのを見て、近年のLVUは、膨大な視覚データをMLLMによって効果的に処理できる管理可能なセグメントに自動解析することで幻覚を軽減する。 SFTベースのツールコール手法はこの目的を達成することができるが、通常は大量の細粒度で高品質なデータを必要とし、制約されたツールコールトラジェクトリに悩まされる。本稿では,強化学習(Reinforcement Learning, RL)を活用する新しいビデオTIRを提案する。 VideoTIRは、Zero-RLとSFTのコールドスタートの両方を探索し、MLLMが意味のあるビデオセグメント/画像/領域を検索し、フォーカスできるようにする。冗長なツールコールを減らすために,段階的に報酬を割り当て,ロールアウトを再利用することで,呼び出しプロセスの効率を向上させるツールキットアクショングループポリシー最適化(TAGPO)を提案する。さらに,高品質なトラジェクトリデータを生成するサンドボックスベースのトラジェクトリ合成フレームワークを開発した。提案手法の有効性と有効性を示す3つの長ビデオQAベンチマーク実験を行った。

論文の概要: VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

関連論文リスト