MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
- URL: http://arxiv.org/abs/2508.19527v1
- Date: Wed, 27 Aug 2025 02:45:09 GMT
- Title: MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
- Authors: Zhiting Gao, Dan Song, Diqiong Jiang, Chao Xue, An-An Liu,
- Abstract summary: Motion generation is essential for animating virtual characters and embodied agents.<n>TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality.
- Score: 38.42799902378583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion generation is essential for animating virtual characters and embodied agents. While recent text-driven methods have made significant strides, they often struggle with achieving precise alignment between linguistic descriptions and motion semantics, as well as with the inefficiencies of slow, multi-step inference. To address these issues, we introduce TMR++ Aligned Preference Optimization (TAPO), an innovative framework that aligns subtle motion variations with textual modifiers and incorporates iterative adjustments to reinforce semantic grounding. To further enable real-time synthesis, we propose MotionFLUX, a high-speed generation framework based on deterministic rectified flow matching. Unlike traditional diffusion models, which require hundreds of denoising steps, MotionFLUX constructs optimal transport paths between noise distributions and motion spaces, facilitating real-time synthesis. The linearized probability paths reduce the need for multi-step sampling typical of sequential methods, significantly accelerating inference time without sacrificing motion quality. Experimental results demonstrate that, together, TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality, while also accelerating generation speed. The code and pretrained models will be released.
Related papers
- ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [38.82543734940858]
Text-to-motion generation holds immense potential for applications in gaming, film, and robotics.<n>There exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent motions.<n>We propose Reward-guided sampling Alignment (ReAlign) to address this limitation.<n>Our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-11-24T15:23:36Z) - DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis [15.304037069236536]
DEMO is a flow-matching generative framework for audio-driven talking-head video synthesis.<n>It delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze.
arXiv Detail & Related papers (2025-10-12T15:10:33Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion [0.881371061335494]
We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation.<n>RDLA restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously.<n>This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup.
arXiv Detail & Related papers (2025-03-13T15:54:45Z) - Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.<n>Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.<n>We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z) - iMoT: Inertial Motion Transformer for Inertial Navigation [0.5199807441687141]
iMoT is an innovative Transformer-based inertial odometry method.<n>It retrieves cross-modal information from motion and rotation modalities for accurate positional estimation.<n>iMoT significantly outperforms state-of-the-art methods in delivering superior robustness and accuracy in trajectory reconstruction.
arXiv Detail & Related papers (2024-12-13T22:52:47Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.