Related papers: MoLingo: Motion-Language Alignment for Text-to-Motion Generation

MoLingo: Motion-Language Alignment for Text-to-Motion Generation

URL: http://arxiv.org/abs/2512.13840v1
Date: Mon, 15 Dec 2025 19:22:40 GMT
Title: MoLingo: Motion-Language Alignment for Text-to-Motion Generation
Authors: Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll,
Abstract summary: MoLingo is a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space.<n>We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close.<n>We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment.
Score: 50.33970522600594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

Related papers

Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation [37.496002022338395]
We argue that the solution lies in an architectural shift towards Latent System 2 Reasoning.<n>We propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process.<n>We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous)
arXiv Detail & Related papers (2025-12-30T09:17:44Z)
SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion [74.70024991949269]
We introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models.<n>Key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets.<n>Results show that SceneAdapt effectively injects scene awareness into text-to-motion models.
arXiv Detail & Related papers (2025-10-14T23:42:10Z)
Compressed and Smooth Latent Space for Text Diffusion Modeling [71.87805084454187]
We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion.<n>We demonstrate that text representations can be compressed by $8times$ while maintaining generation quality comparable to token-level diffusion models.<n>We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms.
arXiv Detail & Related papers (2025-06-26T12:05:13Z)
ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [48.894439350114396]
We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-05-08T06:19:18Z)
FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model. To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z)
LEAD: Latent Realignment for Human Motion Diffusion [12.40712030002265]
Our goal is to generate realistic human motion from natural language. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.
arXiv Detail & Related papers (2024-10-18T14:43:05Z)
AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism [24.049207982022214]
We propose textbftT2M, a two-stage method with multi-perspective attention mechanism. Our method outperforms the current state-of-the-art in terms of qualitative and quantitative evaluation.
arXiv Detail & Related papers (2023-09-02T02:18:17Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.