Related papers: MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

URL: http://arxiv.org/abs/2506.05952v2
Date: Thu, 07 Aug 2025 09:28:34 GMT
Title: MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation
Authors: Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim,
Abstract summary: MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
Score: 3.6669020073583756
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

Related papers

ReMoMask: Retrieval-Augmented Masked Motion Generation [8.471755159366221]
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions.<n>We propose ReMoMask, a unified framework integrating three key innovations.<n>A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision.<n>A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts.
arXiv Detail & Related papers (2025-08-04T16:56:35Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
Absolute Coordinates Make Motion Generation Easy [8.153961351540834]
State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D.<n>We propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space.
arXiv Detail & Related papers (2025-05-26T00:36:00Z)
FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation [0.6554326244334868]
FlowMotion incorporates a training objective that focuses on more accurately predicting target motion in 3D human motion generation.<n>FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset.
arXiv Detail & Related papers (2025-04-02T03:55:21Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text. We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z)
BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework. BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z)
Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z)
Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis. Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame. We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z)
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation. We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.