Fugu-MT 論文翻訳(概要): Demystifing Video Reasoning

論文の概要: Demystifing Video Reasoning

arxiv url: http://arxiv.org/abs/2603.16870v1
Date: Tue, 17 Mar 2026 17:59:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.477138
Title: Demystifing Video Reasoning
Title（参考訳）: Demystifing Video Reasoning
Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang,
Abstract要約: ビデオモデルにおける推論は、主に拡散認知のステップに沿って現れることを示す。モデル性能に重要ないくつかの創発的推論行動を特定する。これらの知見に触発され、私たちは概念実証としてトレーニングフリー戦略を提示した。
参考スコア（独自算出の注目度）: 71.53763299316041
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
Abstract（参考訳）: 拡散ベースのビデオモデルは、非自明な推論能力を示す。それまでの作業では、ビデオフレーム間で逐次展開される推論を前提としたChain-of-Frames(CoF)メカニズムが特徴だった。この研究では、この仮定に挑戦し、根本的に異なるメカニズムを明らかにする。ビデオモデルにおける推論は、主に拡散認知のステップに沿って現れることを示す。定性的解析と対象探索実験により、モデルが早期段階において複数の候補解を探索し、段階的に最終解へと収束する過程、すなわち、我々はChain-of-Steps (CoS) と呼ぶ過程を見出した。このコアメカニズム以外にも,(1)動作記憶,永続的参照の実現,(2)自己補正と拡張,不正確な中間解からの回復,(3)動作前の認識,(3)初期ステップのセマンティックグラウンドの設定,その後のステップの構造化操作など,モデル性能に不可欠ないくつかの創発的推論行動を特定する。拡散過程において、拡散変換器内の自己進化関数の特殊化がさらに発見され、初期層は高密度知覚構造をコードし、中層は推論を実行し、後層は遅延表現を統合する。これらの知見に感化されて,異なるランダムシードを持つ同一モデルから潜在軌道をアンサンブルすることで,推論がいかに改善できるかを示す,単純な学習自由戦略を概念実証として提示する。全体として、我々の研究は、ビデオ生成モデルに推論がどのように現れるのかを体系的に理解し、インテリジェンスのための新しい基盤として、ビデオモデル固有の推論ダイナミクスをより活用するために、将来の研究をガイドする基盤を提供する。

論文の概要: Demystifing Video Reasoning

関連論文リスト