Fugu-MT 論文翻訳(概要): Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

論文の概要: Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

arxiv url: http://arxiv.org/abs/2603.17693v1
Date: Wed, 18 Mar 2026 13:10:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.713527
Title: Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Title（参考訳）: 合成ビデオによるビデオ推論のための伝達可能な時間的プリミティブの学習
Authors: Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu,
Abstract要約: 時間的プリミティブをモデルに教えるポストトレーニングフレームワークであるSynRLを紹介する。時間的理解を短期的原始(速度,方向)と長期的認知的原始に分解する。単純な幾何学的形状のトレーニングにもかかわらず、SynRLは時間的グラウンド、複雑な推論、一般的なビデオ理解にまたがる15のベンチマークで大幅に改善されている。
参考スコア（独自算出の注目度）: 52.00944453189226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
Abstract（参考訳）: 画像から映像への遷移は、視覚言語モデル(VLM)が静的パターンの認識から、運動軌跡、速度変化、状態遷移といった時間的ダイナミクスの推論に移行することを必要とする。 1)既存のデータセットは時間中心性に欠けることが多く、その場合、解答は全体的時間統合を必要とするのではなく、孤立したキーフレームから推測される。我々は、時間的プリミティブをモデルに教えるポストトレーニングフレームワークであるSynRLを紹介し、方向、速度、状態追跡を含む時間的理解の基本的な構成要素について紹介する。私たちの重要な洞察は、プログラムで生成された合成ビデオから学んだこれらの抽象的プリミティブが、現実のシナリオに効果的に転送されるということです。時間的理解を短時間の知覚的プリミティブ(速度,方向)と長期の認知的プリミティブに分解し、7.7K CoT と 7K RL のサンプルをコードベースビデオ生成によるフレームレベルアノテーションで構築する。単純な幾何学的形状のトレーニングにもかかわらず、SynRLは時間的グラウンド、複雑な推論、一般的なビデオ理解にまたがる15のベンチマークで大幅に改善されている。興味深いことに、我々の7.7Kの合成CoTサンプルは、ビデオR1より165Kの現実世界サンプルより優れています。これは、フレームの変化によるフレームの追跡や速度の比較といった、抽象的な合成パターンから複雑な実世界のシナリオへ効果的に移行する基本的な時間的スキルに起因する。慎重に設計された合成データによるビデオ時間学習は、よりコスト効率の良いスケーリングパスを提供する。

論文の概要: Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

関連論文リスト