ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
- URL: http://arxiv.org/abs/2511.19217v1
- Date: Mon, 24 Nov 2025 15:23:36 GMT
- Title: ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
- Authors: Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou, Hongsong Wang,
- Abstract summary: Text-to-motion generation holds immense potential for applications in gaming, film, and robotics.<n>There exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent motions.<n>We propose Reward-guided sampling Alignment (ReAlign) to address this limitation.<n>Our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
- Score: 38.82543734940858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
Related papers
- MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment [38.42799902378583]
Motion generation is essential for animating virtual characters and embodied agents.<n>TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality.
arXiv Detail & Related papers (2025-08-27T02:45:09Z) - ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [48.894439350114396]
We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-05-08T06:19:18Z) - CASIM: Composite Aware Semantic Injection for Text to Motion Generation [15.53049009014166]
We propose a composite-aware semantic injection mechanism that learns the dynamic correspondence between text and motion tokens.<n> Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores.
arXiv Detail & Related papers (2025-02-04T07:22:07Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.