IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
- URL: http://arxiv.org/abs/2512.10730v1
- Date: Thu, 11 Dec 2025 15:16:06 GMT
- Title: IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
- Authors: Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling-An Zeng, Jian-Fang Hu, Xihan Wei, Wei-Shi Zheng,
- Abstract summary: Assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation.<n>We introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance.
- Score: 54.36300724708094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.
Related papers
- MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation [3.6669020073583756]
MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
arXiv Detail & Related papers (2025-06-06T10:26:54Z) - Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward [8.470241117250243]
This paper focuses on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation.<n>It offers insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis.<n>This research underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies.
arXiv Detail & Related papers (2025-05-31T11:02:24Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - CASIM: Composite Aware Semantic Injection for Text to Motion Generation [15.53049009014166]
We propose a composite-aware semantic injection mechanism that learns the dynamic correspondence between text and motion tokens.<n> Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores.
arXiv Detail & Related papers (2025-02-04T07:22:07Z) - PackDiT: Joint Human Motion and Text Generation via Mutual Prompting [22.53146582495341]
PackDiT is the first diffusion-based generative model capable of performing various tasks simultaneously.<n>We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106.<n>Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-01-27T22:51:45Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry.
Recent advances rely on generative models with domain knowledge for text-driven motion generation.
We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.