ControlMM: Controllable Masked Motion Generation
- URL: http://arxiv.org/abs/2410.10780v1
- Date: Mon, 14 Oct 2024 17:50:27 GMT
- Title: ControlMM: Controllable Masked Motion Generation
- Authors: Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov,
- Abstract summary: We propose ControlMM, a novel approach incorporating spatial control signals into the generative masked motion model.
ControlMM achieves real-time, high-fidelity, and high-precision controllable motion generation simultaneously.
ControlMM generates motions 20 times faster than diffusion-based methods.
- Score: 38.16884934336603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, despite achieving acceptable control precision, these models suffer from generation speed and fidelity limitations. To address these challenges, we propose ControlMM, a novel approach incorporating spatial control signals into the generative masked motion model. ControlMM achieves real-time, high-fidelity, and high-precision controllable motion generation simultaneously. Our approach introduces two key innovations. First, we propose masked consistency modeling, which ensures high-fidelity motion generation via random masking and reconstruction, while minimizing the inconsistency between the input control signals and the extracted control signals from the generated motion. To further enhance control precision, we introduce inference-time logit editing, which manipulates the predicted conditional motion distribution so that the generated motion, sampled from the adjusted distribution, closely adheres to the input control signals. During inference, ControlMM enables parallel and iterative decoding of multiple motion tokens, allowing for high-speed motion generation. Extensive experiments show that, compared to the state of the art, ControlMM delivers superior results in motion quality, with better FID scores (0.061 vs 0.271), and higher control precision (average error 0.0091 vs 0.0108). ControlMM generates motions 20 times faster than diffusion-based methods. Additionally, ControlMM unlocks diverse applications such as any joint any frame control, body part timeline control, and obstacle avoidance. Video visualization can be found at https://exitudio.github.io/ControlMM-page
Related papers
- Uniformly Accelerated Motion Model for Inter Prediction [38.34487653360328]
In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly.
In Versatile Video Coding (VVC), existing inter prediction methods assume uniform speed motion between consecutive frames.
We introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames.
arXiv Detail & Related papers (2024-07-16T09:46:29Z) - MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model [29.93359157128045]
This work introduces MotionLCM, extending controllable motion generation to a real-time level.
We first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model.
By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation.
arXiv Detail & Related papers (2024-04-30T17:59:47Z) - MMM: Generative Masked Motion Model [10.215003912084944]
MMM is a novel yet simple motion generation paradigm based on Masked Motion Model.
By attending to motion and text tokens in all directions, MMM captures inherent dependency among motion tokens and semantic mapping between motion and text tokens.
MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.
arXiv Detail & Related papers (2023-12-06T16:35:59Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation [57.539634387672656]
Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality.
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation.
arXiv Detail & Related papers (2023-12-04T18:58:38Z) - OmniControl: Control Any Joint at Any Time for Human Motion Generation [46.293854851116215]
We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model.
We propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals.
At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion.
arXiv Detail & Related papers (2023-10-12T17:59:38Z) - MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators [108.67006263044772]
This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals.
We first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction.
Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters.
arXiv Detail & Related papers (2023-06-19T12:58:17Z) - Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.