Fugu-MT 論文翻訳(概要): ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

論文の概要: ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.20342v1
Date: Tue, 19 May 2026 18:01:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.304433
Title: ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Title（参考訳）: ParaVT: エージェントビデオ強化学習における並列ツールのパラドックス前処理
Authors: Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing,
Abstract要約: 我々はParaVTを紹介した。ParaVTは、Parallel Video Tool呼び出しのための、最初のマルチエージェントのエンドツーエンドRLトレーニングフレームワークである。 ParaVTはQwen3-VLベースラインを平均で+7.9%改善し、PARA-GRPOはトレーニングタイムのフォーマット準拠を0.13から0.64に引き上げた。
参考スコア（独自算出の注目度）: 91.51460129144654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)を強化学習(RL)で訓練し,ビデオ処理ツール(例えば,収穫)をネイティブに呼び出すことが,映像理解への有望な道のりとなっている。しかし、既存のネイティブ-RLメソッドディスパッチツールコールは逐次的に(つまり、1回1回)、ピア修正なしでエラーを伝播する1つの誤り、マルチターンツールコールの破損状況、推論コストはターン数とともに線形にスケールする。我々はParaVTを紹介した。ParaVTは、Parallel Video Toolコールのための最初のマルチエージェントのエンドツーエンドRLトレーニングフレームワークで、よりクリーンなコンテキストと耐障害性を実現するために、複数のタイムウインドウ作物を1ターンでディスパッチする。しかし、標準のRLをParaVTに適用することで、ツール優先パラドックス(Tool Prior Paradox)と呼ばれる障害が明らかになる。フォーマットは安定しているが、RLはツールコールをゼロとし、事前の強さはフォーマットの崩壊とツール探索の両方の共有ドライバであることを示している。 PARA-GRPO(Parseability-Anchored and Ratio-gAted GRPO)を提案する。一倒産しがちな構造的地位に限って準用する形式的報酬 (ii) ツールを呼び出すと、そのツールをスキップして測定可能な報酬信号が得られるようなトレーニングプロンプトを生成する、フレーム・予算単位のランダム化。 6つの長ビデオ理解ベンチマークで、ParaVTはQwen3-VLベースラインを平均で+7.9%改善し、PARA-GRPOはトレーニング時のフォーマット準拠を0.13から0.64に引き上げた。現代のLMMでは、ツール機能がますます内部化されていくにつれて、RLは結果の先行と協調し、ParaVTはエージェントRLの一般的なレシピを提供する。コード、データ、モデルウェイトが公開されている。

論文の概要: ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

関連論文リスト