Fugu-MT 論文翻訳(概要): Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

論文の概要: Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

arxiv url: http://arxiv.org/abs/2605.24674v1
Date: Sat, 23 May 2026 17:22:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.306593
Title: Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing
Title（参考訳）: ビデオ編集用拡散変換器における暗黙の推論
Authors: Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian,
Abstract要約: 本稿では,2つの補完コンポーネントを中心に構築された暗黙の推論ビデオ編集用DiTフレームワークであるRVEDiTを提案する。 RVEDiTは最先端のベースラインを一貫して上回り、特にローカライズされた編集や構成的な編集において大きな利益を得ている。
参考スコア（独自算出の注目度）: 55.211537893248675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.
Abstract（参考訳）: インストラクションベースのビデオ編集では、非関連コンテンツと時間的一貫性を保ちながら、自然言語の指示に従ってソース映像を変換する必要がある。既存のDiffusion Transformer (DiT) エディタはこのタスクに2つの構造的理由から苦労している。まず、コンディショニング信号はすべてのトランスフォーマーブロックに無差別に供給され、単一のトークンストリームにグローバルな編集意図ときめ細かい視覚的証拠の両方をエンコードせざるを得ない。第二に、編集を管理する横断的なパターンは、ピクセルレベルの再構成によってのみ間接的に管理され、モデルの内部推論プロセスは制約を受けていない。両制約に対処するため,2つの相補的なコンポーネントを中心に構築された暗黙的推論ビデオ編集型DiTフレームワークであるRVEDiTを提案する。最初のGranularity-Routed Token Conditioningでは、マルチモーダルLCMから抽出した学習可能な編集トークンを導入し、浅いブロックにルーティングすると同時に、より深いブロックのためにネイティブな視覚的およびテキスト的トークンを保存し、バックボーン内に粗い編集プロセスを誘導する。第2のReference-Anchored Attention Alignmentは、トレーニング中にパラメータ共有参照ブランチを使用して、編集と参照ブランチの注目特徴間の相互情報を最大化し、追加の推論コストを発生させることなく、モデルの内部推論を規則化する。標準的なインストラクションベースのビデオ編集ベンチマークの実験では、RVEDiTは最先端のベースラインを一貫して上回り、特にローカライズドやコンストラクショナルな編集に大きく貢献している。

論文の概要: Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

関連論文リスト