Fugu-MT 論文翻訳(概要): Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

論文の概要: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

arxiv url: http://arxiv.org/abs/2508.07246v1
Date: Sun, 10 Aug 2025 08:59:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.772924
Title: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers
Title（参考訳）: 運動線形拡散変換器を用いた一貫性・制御可能な画像アニメーション
Authors: Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen,
Abstract要約: 画像アニメーションにおける効率性, 外観の整合性, 動きの滑らかさを向上するフレームワークであるMiraMoを提案する。具体的には,(1)ベニラ自己注意を効率的な線形注意に置き換えて生成品質を保ちながら計算オーバーヘッドを低減するための基本的テキスト・ビデオアーキテクチャ,(2)フレームを直接予測するのではなく動きのダイナミクスをモデル化する新たな動き残留学習パラダイム,(3)動きの滑らかさと表現性のバランスをとる動的制御モジュールによって補完された推論中のDCTに基づくノイズ改善戦略,の3つの重要な要素を紹介する。
参考スコア（独自算出の注目度）: 23.176184261595747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
Abstract（参考訳）: 画像アニメーションは拡散モデルの強力な生成能力によって大きく進歩している。しかし、静的な入力画像との外観整合性を維持し、生成したアニメーションにおける急激な動き遷移を緩和することは、まだ持続的な課題である。テキスト・トゥ・ビデオ(T2V)生成は拡散トランスフォーマーモデルで顕著な性能を示したが、画像アニメーション分野はいまだに最新のT2Vアプローチの遅れであるU-Netベースの拡散モデルに依存している。さらに、トランスフォーマーにおけるバニラ自己注意機構の二次的複雑さは、画像アニメーションを特にリソース集約化するために、大量の計算要求を課している。これらの課題に対処するため,画像アニメーションにおける効率性,外観の整合性,動きの平滑性を高めるためのフレームワークであるMiraMoを提案する。具体的には,(1)ベニラ自己注意を効率的な線形注意に置き換えて生成品質を保ちながら計算オーバーヘッドを低減するための基本的テキスト・ビデオアーキテクチャ,(2)フレームを直接予測するのではなく動きのダイナミクスをモデル化する新たな動き残留学習パラダイム,(3)動きの滑らかさと表現性のバランスをとる動的制御モジュールによって補完された推論中のDCTに基づくノイズ改善戦略,の3つの重要な要素を紹介する。最新の手法に対する大規模な実験は、推論速度を高速化した一貫性のある滑らかで制御可能なアニメーションを生成する上で、MiraMoの優位性を検証する。さらに、モーション転送およびビデオ編集タスクの応用を通して、MiraMoの汎用性を実証する。

論文の概要: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

関連論文リスト