MoSa: Motion Generation with Scalable Autoregressive Modeling
- URL: http://arxiv.org/abs/2511.01200v1
- Date: Mon, 03 Nov 2025 03:47:58 GMT
- Title: MoSa: Motion Generation with Scalable Autoregressive Modeling
- Authors: Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu,
- Abstract summary: We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation.<n>MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed.<n>MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning.
- Score: 41.75643989500953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web
Related papers
- LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution [52.627063566555194]
We introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR.<n>Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations.<n>Our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques.
arXiv Detail & Related papers (2026-02-03T06:53:19Z) - MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models [0.9222161299777548]
We introduce $textbfMoVE (Mixture of Value Embeddings)$, a mechanism that breaks the rigid structural coupling of model capacity to computational cost.<n>MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers.<n>We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation.
arXiv Detail & Related papers (2026-01-30T12:07:23Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - ReMoMask: Retrieval-Augmented Masked Motion Generation [13.42229775333369]
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions.<n>We propose ReMoMask, a unified framework integrating three key innovations.<n>A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision.<n>A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts.
arXiv Detail & Related papers (2025-08-04T16:56:35Z) - MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation [3.6669020073583756]
MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
arXiv Detail & Related papers (2025-06-06T10:26:54Z) - Plug-and-Play Context Feature Reuse for Efficient Masked Generation [36.563229330549284]
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis.<n>We introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs.
arXiv Detail & Related papers (2025-05-25T10:57:35Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - Progressive Multi-stage Interactive Training in Mobile Network for
Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI)
First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases.
Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages.
Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.