Related papers: MotionGPT3: Human Motion as a Second Modality

MotionGPT3: Human Motion as a Second Modality

URL: http://arxiv.org/abs/2506.24086v3
Date: Mon, 03 Nov 2025 02:15:34 GMT
Title: MotionGPT3: Human Motion as a Second Modality
Authors: Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen,
Abstract summary: MotionGPT3 is a bimodal motion-language model for both understanding and generation.<n>A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow.<n>Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation.
Score: 28.616340011811843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.

Related papers

T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z)
OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z)
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow [17.95248351806955]
We introduce DualFlow, a framework for multi-modal two-person motion generation.<n>It synthesises motion on diverse inputs, including text, music, and prior motion sequences.<n>It produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
arXiv Detail & Related papers (2025-09-28T22:36:18Z)
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation [31.484189825477877]
Motion-R1 is a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism.<n>By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation.<n>Experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-06-12T05:21:43Z)
MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation [3.6669020073583756]
MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
arXiv Detail & Related papers (2025-06-06T10:26:54Z)
Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework.<n>We propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing.<n>During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed.
arXiv Detail & Related papers (2025-06-05T16:18:32Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.<n>Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.<n>We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z)
Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer [24.166147954731652]
Multi-person interactive motion generation is a critical yet under-explored domain in computer character animation.<n>Current research often employs separate module branches for individual motions, leading to a loss of interaction information.<n>We propose a novel, unified approach that models multi-person motions and their interactions within a single latent space.
arXiv Detail & Related papers (2024-12-21T15:35:50Z)
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models [18.125860678409804]
We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from text descriptions. M2D2M adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions.
arXiv Detail & Related papers (2024-07-19T17:57:33Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction [58.67761673662716]
Humans are highly adaptable, swiftly switching between different modes to handle different tasks, situations and contexts. In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline. This work proposes to model two concurrent mechanisms that jointly control human motion.
arXiv Detail & Related papers (2023-07-24T12:21:33Z)
InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions [49.097973114627344]
We present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process. We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. We propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame.
arXiv Detail & Related papers (2023-04-12T08:12:29Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs. Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.