Related papers: ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

URL: http://arxiv.org/abs/2505.04974v2
Date: Fri, 01 Aug 2025 11:56:05 GMT
Title: ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
Authors: Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou,
Abstract summary: We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
Score: 48.894439350114396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.

Related papers

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding [25.254783224309488]
We present DiMo, a discrete diffusion-style framework, which extends masked modeling to text--motion understanding and generation.<n>Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement.<n>Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding.
arXiv Detail & Related papers (2026-02-04T04:01:02Z)
MoLingo: Motion-Language Alignment for Text-to-Motion Generation [50.33970522600594]
MoLingo is a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space.<n>We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close.<n>We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment.
arXiv Detail & Related papers (2025-12-15T19:22:40Z)
ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [38.82543734940858]
Text-to-motion generation holds immense potential for applications in gaming, film, and robotics.<n>There exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent motions.<n>We propose Reward-guided sampling Alignment (ReAlign) to address this limitation.<n>Our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-11-24T15:23:36Z)
UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes [26.71077287710599]
We propose UniHM, a unified motion language model that leverages diffusion-based generation for scene-aware human motion.<n>UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes.<n>Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of
arXiv Detail & Related papers (2025-05-19T07:02:12Z)
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks [30.333659816277823]
We presenttextbfMoTe, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously.<n>MoTe is composed of three components: Motion-Decoder (MED), Text-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM)
arXiv Detail & Related papers (2024-11-29T15:48:24Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts. We propose the use of motion token, a discrete and compact motion representation. Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z)
TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.