Fugu-MT 論文翻訳(概要): SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

論文の概要: SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2512.13874v1
Date: Mon, 15 Dec 2025 20:14:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-17 16:49:26.480572
Title: SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Title（参考訳）: SAGE:強化学習によるロングビデオ推論のためのスマートな非水平エージェントのトレーニング
Authors: Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi,
Abstract要約: SAGEは、1ターンでより単純な問題を処理しながら、長いビデオのマルチターン推論を行うエージェントシステムである。さらに,SAGE-MMにおける任意の水平推論能力を注入するための効果的なRLポストトレーニングレシピを提案する。提案手法の有効性を実証的に検証し,オープンエンドビデオ推論タスクにおいて最大6.1%の顕著な改善が見られた。
参考スコア（独自算出の注目度）: 53.67654657011112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.
Abstract（参考訳）: 人間としては、我々は自然な非水平推論者であり、例えば、長いビデオを反復的にスキミングするか、与えられたタスクに必要であればショートビデオをフルに見るかを決めることができる。このことを念頭に置いて、ビデオ推論モデルが様々な期間にわたって柔軟に推理されることを期待しているだろう。しかし、SOTAモデルは、長いビデオ全体を見るのと同じように、大量のフレームを処理しながら、1ターンで回答を予測するように訓練されている。これは、パフォーマンスの高い非水平ビデオ推論システムを開発することは可能か? 人間の行動にインスパイアされたエージェントシステムであるSAGEを提案する。次に,Gemini-2.5-Flashを用いた簡易な合成データ生成パイプラインを導入し,SAGEのコアに位置するオーケストレータであるSAGE-MMをトレーニングする。さらに,SAGE-MMにおける任意の水平推論能力を注入するための効果的なRLポストトレーニングレシピを提案する。第3に、現実のエンターテイメントのユースケースにおけるビデオ推論能力を評価するために、平均700秒以上のSAGE-Benchをキュレートする。最後に、我々のシステム、データ、およびRLレシピの有効性を実証的に検証し、オープンエンドのビデオ推論タスクにおいて最大6.1%の顕著な改善が見られた。

論文の概要: SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

関連論文リスト