Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation
- URL: http://arxiv.org/abs/2601.14788v1
- Date: Wed, 21 Jan 2026 09:11:45 GMT
- Title: Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation
- Authors: Yifei Liu, Changxing Ding, Ling Guo, Huaiguang Jiang, Qiong Cao,
- Abstract summary: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks.<n>Current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process.<n>This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges.
- Score: 34.87535133080741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.
Related papers
- Causal Motion Diffusion Models for Autoregressive Motion Generation [19.61051102039212]
Causal Motion Diffusion Models (CMDM) is a unified framework for autoregressive motion generation.<n> CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations.<n>Experiments on HumanML3D and SnapMoGen demonstrate CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness.
arXiv Detail & Related papers (2026-02-26T03:58:25Z) - IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation [54.36300724708094]
Assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation.<n>We introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance.
arXiv Detail & Related papers (2025-12-11T15:16:06Z) - Real-Time Motion-Controllable Autoregressive Video Diffusion [79.32730467857535]
We propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control.<n>We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement with a trajectory-based reward model.<n>Our design preserves the Markov property through a Self-Rollout learning mechanism and accelerates training by selectively denoising steps.
arXiv Detail & Related papers (2025-10-09T12:17:11Z) - FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Absolute Coordinates Make Motion Generation Easy [8.153961351540834]
State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D.<n>We propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space.
arXiv Detail & Related papers (2025-05-26T00:36:00Z) - Generative Pre-trained Autoregressive Diffusion Transformer [74.25668109048418]
GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer.<n>It unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis.<n>It autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics.
arXiv Detail & Related papers (2025-05-12T08:32:39Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space [40.60429652169086]
Text-conditioned streaming motion generation requires us to predict the next-step human pose based on variable-length historical motions and incoming texts.<n>Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths.<n>We propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model.
arXiv Detail & Related papers (2025-03-19T17:32:24Z) - Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection [43.49146665908238]
Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision.<n>We introduce a novel frequency-guided diffusion model with perturbation training.<n>We employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components.
arXiv Detail & Related papers (2024-12-04T05:43:53Z) - BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion
Synthesis [14.331548412833513]
Mixed reality applications require tracking the user's full-body motion to enable an immersive experience.
We propose BoDiffusion -- a generative diffusion model for motion synthesis to tackle this under-constrained reconstruction problem.
We present a time and space conditioning scheme that allows BoDiffusion to leverage sparse tracking inputs while generating smooth and realistic full-body motion sequences.
arXiv Detail & Related papers (2023-04-21T16:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.