Fugu-MT 論文翻訳(概要): Video Models Can Reason with Verifiable Rewards

論文の概要: Video Models Can Reason with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2605.15458v1
Date: Thu, 14 May 2026 22:40:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.117943
Title: Video Models Can Reason with Verifiable Rewards
Title（参考訳）: ビデオモデルは、検証可能なリワードに対応できる
Authors: Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen,
Abstract要約: 本稿では,ルールベースフィードバックによる映像拡散モデルの最適化手法であるVideoRLVRを紹介する。 VideoRLVRは、検証可能な視覚軌跡の生成としてビデオ推論を定式化する。客観的な成功基準を持つ3つのプロシージャ生成ドメインであるMaze, FlowFree, Sokoban の VideoRLVR の評価を行った。
参考スコア（独自算出の注目度）: 31.381840584972675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
Abstract（参考訳）: ビデオ拡散モデルは知覚的リアリズムと時間的コヒーレンスを急速に進歩させたが、それらは検証可能な推論ではなく、もっともらしい生成に最適化されている。この制限は、生成されたビデオが明示的な空間的、時間的、論理的制約を満たす必要があるタスクにおいて特に顕著である。推論指向言語モデルにおける強化学習と検証可能な報酬(RLVR)の役割に着想を得て,ルールベースのフィードバックで映像拡散モデルの最適化を行うための実践的レシピであるVideoRLVRを紹介した。 VideoRLVRは、検証可能な視覚軌道の生成としてビデオ推論を定式化し、SDE-GRPO最適化バックボーン、密分された報酬、効率的なトレーニングのためのアーリーステップフォーカス戦略で構成されている。 Early-Step Focus戦略は、ポリシの最適化を初期段階に制限し、パフォーマンスを維持しながらトレーニングのレイテンシを約40%削減する。客観的な成功基準を持つ3つのプロシージャ生成ドメインであるMaze, FlowFree, Sokoban の VideoRLVR の評価を行った。これらのタスク全体では、VideoRLVRは教師付き微調整ベースラインよりも一貫して改善されており、低レベルの設定で特に重要となる高密度な分解報酬が証明されている。我々のRL最適化モデルは、これらの検証可能な推論ベンチマークとアウトオブドメインベンチマークにおいて、評価されたプロプライエタリおよびオープンソースビデオ生成モデルよりも優れています。これらの結果は、検証可能なRLが、ビデオモデルを知覚的模倣を超えて、より信頼性の高いルール一貫性のある視覚的推論へと移行できることを示唆している。

論文の概要: Video Models Can Reason with Verifiable Rewards

関連論文リスト