Related papers: ReMoMask: Retrieval-Augmented Masked Motion Generation

ReMoMask: Retrieval-Augmented Masked Motion Generation

URL: http://arxiv.org/abs/2508.02605v1
Date: Mon, 04 Aug 2025 16:56:35 GMT
Title: ReMoMask: Retrieval-Augmented Masked Motion Generation
Authors: Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang,
Abstract summary: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions.<n>We propose ReMoMask, a unified framework integrating three key innovations.<n>A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision.<n>A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts.
Score: 8.471755159366221
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation [3.6669020073583756]
MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
arXiv Detail & Related papers (2025-06-06T10:26:54Z)
Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
Motion Anything: Any to Motion Generation [24.769413146731264]
Motion Anything is a multimodal motion generation framework.<n>Our model adaptively encodes multimodal conditions, including text and music, improving controllability.<n>Text-Music-Dance dataset consists of 2,153 pairs of text, music, and dance, making it twice the size of AIST++.
arXiv Detail & Related papers (2025-03-10T06:04:31Z)
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text. We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z)
BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework. BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z)
FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing [56.29102849106382]
FineMoGen is a diffusion-based motion generation and editing framework. It can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models.
arXiv Detail & Related papers (2023-12-22T16:56:02Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation. We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.