Related papers: T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

URL: http://arxiv.org/abs/2602.01352v1
Date: Sun, 01 Feb 2026 17:42:53 GMT
Title: T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation
Authors: Xingzu Zhan, Chen Xie, Honghang Chen, Yixun Lin, Xiaochun Mai,
Abstract summary: Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
Score: 3.6564162676635363
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

Related papers

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation [31.077229364298443]
Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging.<n>Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics.<n>We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model.
arXiv Detail & Related papers (2026-02-21T15:40:37Z)
Temporal Consistency-Aware Text-to-Motion Generation [41.71400323450202]
We propose TCA-T2M, a framework for temporal consistency-aware T2M generation.<n>Our approach introduces a temporal consistency-aware spatial VQ-VAE for cross-sequence temporal alignment.<n> Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-20T08:17:01Z)
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow [17.95248351806955]
We introduce DualFlow, a framework for multi-modal two-person motion generation.<n>It synthesises motion on diverse inputs, including text, music, and prior motion sequences.<n>It produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
arXiv Detail & Related papers (2025-09-28T22:36:18Z)
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z)
WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval [7.349030413222046]
Text-Motion Retrieval aims to retrieve 3D motion sequences semantically relevant to text descriptions.<n>We propose WaMo, a novel wavelet-based multi-frequency feature extraction framework.<n>WaMo captures part-specific and time-varying motion details across multiple resolutions on body joints.
arXiv Detail & Related papers (2025-08-05T11:44:26Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
MotionGPT3: Human Motion as a Second Modality [28.616340011811843]
MotionGPT3 is a bimodal motion-language model for both understanding and generation.<n>A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow.<n>Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation.
arXiv Detail & Related papers (2025-06-30T17:42:22Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model. To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z)
BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework. BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.