Fugu-MT 論文翻訳(概要): MultiCOIN: Multi-Modal COntrollable Video INbetweening

論文の概要: MultiCOIN: Multi-Modal COntrollable Video INbetweening

arxiv url: http://arxiv.org/abs/2510.08561v1
Date: Thu, 09 Oct 2025 17:59:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.30458
Title: MultiCOIN: Multi-Modal COntrollable Video INbetweening
Title（参考訳）: MultiCoin:マルチモードのコントロール式ビデオインテインティング
Authors: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao,
Abstract要約: マルチモーダル制御が可能なビデオインベントリフレームワークである Modelname を紹介する。 DiTとマルチモーダルコントロールの互換性を確保するため、すべてのモーションコントロールを共通スパース表現にマッピングする。実験により、マルチモーダルコントロールにより、よりダイナミックで、カスタマイズ可能で、文脈的に正確な視覚的物語が可能になることが示されている。
参考スコア（独自算出の注目度）: 46.37499813275259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
Abstract（参考訳）: ビデオのインテンションは2つの画像フレーム間のスムーズで自然な遷移を生じさせ、ビデオ編集と長めのビデオ合成に欠かせないツールとなる。この領域の既存の作品では、大きな、複雑な、あるいは複雑な動きを生成できない。特に、それらはユーザ意図の汎用性に対応できず、一般的に中間フレームの詳細をきめ細かな制御を欠いているため、創造的精神との相違につながります。これらのギャップを埋めるために,我々は,動画の微視的補間における柔軟性,使いやすさ,精度のバランスを保ちながら,深度遷移と階層化,運動軌跡,テキストプロンプト,ターゲット領域などのマルチモーダル制御が可能なビデオインベントワイニングフレームワークである \modelname{} を紹介した。これを実現するために,高画質長ビデオを生成することが実証されたため,Diffusion Transformer (DiT) アーキテクチャをビデオ生成モデルとして採用した。 DiTとマルチモーダル制御との互換性を確保するため、すべてのモーションコントロールをビデオ/ノイズ入力として、共通のスパースでユーザフレンドリーなポイントベース表現にマッピングする。さらに、粒度や影響の異なるレベルで動作する様々な制御を尊重するため、各制御と動作制御を2つのブランチに分けて、必要な特徴を導出する前に符号化し、その結果、2つのジェネレータを1つは動作用、もう1つはコンテンツ用とする。最後に,モデルがマルチモーダル制御を円滑に学習することを保証するための段階的学習戦略を提案する。広範に質的かつ定量的な実験により、マルチモーダル制御により、よりダイナミックでカスタマイズ可能で、文脈的に正確な視覚的物語が可能になることが示されている。

論文の概要: MultiCOIN: Multi-Modal COntrollable Video INbetweening

関連論文リスト