CoMo: Compositional Motion Customization for Text-to-Video Generation
- URL: http://arxiv.org/abs/2510.23007v1
- Date: Mon, 27 Oct 2025 04:57:09 GMT
- Title: CoMo: Compositional Motion Customization for Text-to-Video Generation
- Authors: Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen,
- Abstract summary: CoMo is a novel framework for $textbfcompositional motion customization$ in text-to-video generation.<n>It addresses the challenges of motion-merge entanglement and ineffective multi-motion blending.<n>CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation.
- Score: 40.446146411270156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.
Related papers
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z) - ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer [44.33224798292861]
ConMo is a framework that disentangles and recomposes the motions of subjects and camera movements.<n>It enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios.<n>ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation.
arXiv Detail & Related papers (2025-04-03T10:15:52Z) - Motion Anything: Any to Motion Generation [24.769413146731264]
Motion Anything is a multimodal motion generation framework.<n>Our model adaptively encodes multimodal conditions, including text and music, improving controllability.<n>Text-Music-Dance dataset consists of 2,153 pairs of text, music, and dance, making it twice the size of AIST++.
arXiv Detail & Related papers (2025-03-10T06:04:31Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - MotionMix: Weakly-Supervised Diffusion for Controllable Motion
Generation [19.999239668765885]
MotionMix is a weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences.
Our framework consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
arXiv Detail & Related papers (2024-01-20T04:58:06Z) - FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing [56.29102849106382]
FineMoGen is a diffusion-based motion generation and editing framework.
It can synthesize fine-grained motions, with spatial-temporal composition to the user instructions.
FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models.
arXiv Detail & Related papers (2023-12-22T16:56:02Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.