Fugu-MT 論文翻訳(概要): CoMo: Compositional Motion Customization for Text-to-Video Generation

論文の概要: CoMo: Compositional Motion Customization for Text-to-Video Generation

arxiv url: http://arxiv.org/abs/2510.23007v1
Date: Mon, 27 Oct 2025 04:57:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.453484
Title: CoMo: Compositional Motion Customization for Text-to-Video Generation
Title（参考訳）: CoMo:テキスト・ビデオ生成のための合成モーションカスタマイズ
Authors: Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen,
Abstract要約: CoMoは、textbfcompositional motion customization$の新たなフレームワークである。これは、モーションマージの絡み合いと非効率なマルチモーションブレンディングの課題に対処する。 CoMoは最先端のパフォーマンスを実現し、制御可能なビデオ生成能力を大幅に向上させる。
参考スコア（独自算出の注目度）: 40.446146411270156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.
Abstract（参考訳）: 最近のテキスト・ビデオ・モデルは多様なシーンを生成するのに優れていますが、特に複雑なマルチオブジェクト・モーションにおいて、正確なモーション・コントロールに苦戦しています。このギャップに対処するために、シングルモーションのカスタマイズ法が開発されているが、これらは2つの主要な課題、すなわち、動きの出現の絡み合いと非効率なマルチモーションブレンディングのために、構成上のシナリオで失敗する。本稿では,テキスト・ビデオ生成における$\textbf{compositional motion customization}$の新しいフレームワークであるCoMoを紹介する。 CoMoは2段階のアプローチでこれらの問題に対処する。まず、シングルモーション学習フェーズにおいて、静的力学的疎結合チューニングパラダイムは、外見から動きを歪め、動き固有のモジュールを学習する。第二に、多動合成フェーズにおいて、プラグ・アンド・プレイの分割・マージ戦略は、デノナイジング過程における影響を空間的に分離することにより、これらの学習運動を構成する。この領域の研究を容易にするために,マルチモーションの忠実度とブレンディングを評価するために,新しいベンチマークと新しい評価指標を導入する。大規模な実験により、CoMoは最先端のパフォーマンスを実現し、制御可能なビデオ生成能力を大幅に向上した。プロジェクトページはhttps://como6.github.io/.com/です。

論文の概要: CoMo: Compositional Motion Customization for Text-to-Video Generation

関連論文リスト