Related papers: Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

URL: http://arxiv.org/abs/2512.24100v1
Date: Tue, 30 Dec 2025 09:17:44 GMT
Title: Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation
Authors: Yijie Qian, Juncheng Wang, Yuxiang Feng, Chao Xu, Wang Lu, Yang Liu, Baigui Sun, Yiqiang Chen, Yong Liu, Shujun Wang,
Abstract summary: We argue that the solution lies in an architectural shift towards Latent System 2 Reasoning.<n>We propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process.<n>We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous)
Score: 37.496002022338395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

Related papers

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding [25.254783224309488]
We present DiMo, a discrete diffusion-style framework, which extends masked modeling to text--motion understanding and generation.<n>Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement.<n>Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding.
arXiv Detail & Related papers (2026-02-04T04:01:02Z)
MoLingo: Motion-Language Alignment for Text-to-Motion Generation [50.33970522600594]
MoLingo is a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space.<n>We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close.<n>We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment.
arXiv Detail & Related papers (2025-12-15T19:22:40Z)
ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [48.894439350114396]
We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-05-08T06:19:18Z)
KETA: Kinematic-Phrases-Enhanced Text-to-Motion Generation via Fine-grained Alignment [5.287416596074742]
State-of-the-art T2M techniques mainly leverage diffusion models to generate motions with text prompts as guidance.<n>We propose KETA, which decomposes the given text into several decomposed texts via a language model.<n>Experiments demonstrate that KETA achieves up to 1.19x, 2.34x better R precision and FID value on both backbones of the base model, motion diffusion model.
arXiv Detail & Related papers (2025-01-25T03:43:33Z)
Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases [59.32509533292653]
Motion understanding aims to establish a reliable mapping between motion and action semantics. We propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality. Based on KP, we can unify a motion knowledge base and build a motion understanding system.
arXiv Detail & Related papers (2023-10-06T12:08:15Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.