MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
- URL: http://arxiv.org/abs/2410.10780v3
- Date: Fri, 25 Jul 2025 11:24:03 GMT
- Title: MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
- Authors: Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov,
- Abstract summary: We propose MaskControl, the first approach to introduce controllability to the generative masked motion model.<n>First, textitLogits Regularizer implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions.<n>Second, textitLogit optimization explicitly reshapes the token distribution that forces the generated motion to accurately align with the controlled joint positions.
- Score: 38.16884934336603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/
Related papers
- RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z) - Interactive Video Generation via Domain Adaptation [7.397099215417549]
Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation.<n>Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades quality.<n>We identify two key modes failure in these methods, both of which we interpret as domain problems.
arXiv Detail & Related papers (2025-05-30T06:19:47Z) - Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z) - Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models.
Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z) - Mojito: Motion Trajectory and Intensity Control for Video Generation [79.85687620761186]
This paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation.
Experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency.
arXiv Detail & Related papers (2024-12-12T05:26:43Z) - Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis [55.00448838152145]
We show that we only need a single parameter $omega$ to effectively control granularity in diffusion-based synthesis.<n>This simple approach does not require model retraining or architectural modifications and incurs negligible computational overhead.<n>The method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models.
arXiv Detail & Related papers (2024-11-26T08:23:16Z) - ControlAR: Controllable Image Generation with Autoregressive Models [40.74890550081335]
We introduce ControlAR, an efficient framework for integrating spatial controls into autoregressive image generation models.<n>ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens.<n>Results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models.
arXiv Detail & Related papers (2024-10-03T17:28:07Z) - Uniformly Accelerated Motion Model for Inter Prediction [38.34487653360328]
In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly.
In Versatile Video Coding (VVC), existing inter prediction methods assume uniform speed motion between consecutive frames.
We introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames.
arXiv Detail & Related papers (2024-07-16T09:46:29Z) - MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model [29.93359157128045]
This work introduces MotionLCM, extending controllable motion generation to a real-time level.
We first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model.
By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation.
arXiv Detail & Related papers (2024-04-30T17:59:47Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability [93.15085958220024]
ControlNet excels at creating content that closely matches precise contours in user-provided masks.
When these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts.
This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis.
An advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised.
arXiv Detail & Related papers (2024-03-01T11:45:29Z) - MMM: Generative Masked Motion Model [10.215003912084944]
MMM is a novel yet simple motion generation paradigm based on Masked Motion Model.
By attending to motion and text tokens in all directions, MMM captures inherent dependency among motion tokens and semantic mapping between motion and text tokens.
MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.
arXiv Detail & Related papers (2023-12-06T16:35:59Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation [57.539634387672656]
Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality.
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation.
arXiv Detail & Related papers (2023-12-04T18:58:38Z) - OmniControl: Control Any Joint at Any Time for Human Motion Generation [46.293854851116215]
We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model.
We propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals.
At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion.
arXiv Detail & Related papers (2023-10-12T17:59:38Z) - MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators [108.67006263044772]
This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals.
We first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction.
Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters.
arXiv Detail & Related papers (2023-06-19T12:58:17Z) - Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.