Related papers: CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

URL: http://arxiv.org/abs/2403.13900v1
Date: Wed, 20 Mar 2024 18:11:10 GMT
Title: CoMo: Controllable Motion Generation through Language Guided Pose Code Editing
Authors: Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu,
Abstract summary: We introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions. CoMo decomposes motions into discrete and semantically meaningful pose codes. It autoregressively generates sequences of pose codes, which are then decoded into 3D motions.
Score: 57.882299081820626
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.

Related papers

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing [81.28392925790568]
We introduce MotionEdit, a novel dataset for motion-centric image editing.<n>MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted from continuous videos.<n>We propose MotionNFT to compute motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion.
arXiv Detail & Related papers (2025-12-11T04:53:58Z)
CoMo: Compositional Motion Customization for Text-to-Video Generation [40.446146411270156]
CoMo is a novel framework for $textbfcompositional motion customization$ in text-to-video generation.<n>It addresses the challenges of motion-merge entanglement and ineffective multi-motion blending.<n>CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation.
arXiv Detail & Related papers (2025-10-27T04:57:09Z)
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities [36.42160163142448]
We pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks. Our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks.
arXiv Detail & Related papers (2025-04-03T10:53:41Z)
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer [44.33224798292861]
ConMo is a framework that disentangles and recomposes the motions of subjects and camera movements. It enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation.
arXiv Detail & Related papers (2025-04-03T10:15:52Z)
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm [6.920041357348772]
Human motion generation and editing are key components of computer graphics and vision. We introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion.
arXiv Detail & Related papers (2025-02-04T14:43:26Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories. We translate high-level user requests into detailed, semi-dense motion prompts. We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
Human Motion Instruction Tuning [30.71209562108675]
This paper presents LLaMo, a framework for human motion instruction tuning. LLaMo retains motion in its native form for instruction tuning. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis.
arXiv Detail & Related papers (2024-11-25T14:38:43Z)
Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer [55.109778609058154]
Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models. We uncover the roles and interactions of attention elements in capturing and representing motion patterns. We integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer.
arXiv Detail & Related papers (2024-06-10T17:47:14Z)
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing. It delivers superior motion editing performance and exclusively supports large camera movements and actions. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z)
FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing [56.29102849106382]
FineMoGen is a diffusion-based motion generation and editing framework. It can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models.
arXiv Detail & Related papers (2023-12-22T16:56:02Z)
MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z)
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model [35.32967411186489]
MotionDiffuse is a diffusion model-based text-driven motion generation framework. It excels at modeling complicated data distribution and generating vivid motion sequences. It responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts.
arXiv Detail & Related papers (2022-08-31T17:58:54Z)
MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)
Self-supervised Motion Learning from Static Images [36.85209332144106]
Motion from Static Images (MoSI) learns to encode motion information. MoSI can discover regions with large motion even without fine-tuning on the downstream datasets. We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
arXiv Detail & Related papers (2021-04-01T03:55:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.