MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
- URL: http://arxiv.org/abs/2503.15451v2
- Date: Wed, 16 Apr 2025 12:35:53 GMT
- Title: MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
- Authors: Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang,
- Abstract summary: Text-conditioned streaming motion generation requires us to predict the next-step human pose based on variable-length historical motions and incoming texts.<n>Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths.<n>We propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model.
- Score: 40.60429652169086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/
Related papers
- FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models [9.469635938429647]
Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion.
We propose FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos.
Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.
arXiv Detail & Related papers (2025-04-20T08:22:29Z) - Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models [71.63194926457119]
We introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes.
Experiments across scientifictemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks.
arXiv Detail & Related papers (2025-03-02T16:10:32Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
ACDiT is a blockwise Conditional Diffusion Transformer.<n>It offers a flexible between token-wise autoregression and full-sequence diffusion.<n>We show that ACDiT performs best among all autoregressive baselines on image and video generation tasks.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding [29.643549839940025]
We introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method to decode discrete motion tokens in the continuous, raw motion space.
Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and smoother, more natural motions.
arXiv Detail & Related papers (2024-11-29T07:54:56Z) - DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control [12.465927271402442]
Text-conditioned human motion generation allows for user interaction through natural language.<n>DartControl is a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control.<n>Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs.
arXiv Detail & Related papers (2024-10-07T17:58:22Z) - Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency [15.841490425454344]
We propose an end-to-end audio-only conditioned video diffusion model named Loopy.
Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
arXiv Detail & Related papers (2024-09-04T11:55:14Z) - Lagrangian Motion Fields for Long-term Motion Generation [32.548139921363756]
We introduce the concept of Lagrangian Motion Fields, specifically designed for long-term motion generation.
By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of "supermotions"
Our solution is versatile and lightweight, eliminating the need for neural network preprocessing.
arXiv Detail & Related papers (2024-09-03T01:38:06Z) - RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation [5.535590461577558]
RecMoDiffuse is a new recurrent diffusion formulation for temporal modelling.
We demonstrate the effectiveness of RecMoDiffuse in the temporal modelling of human motion.
arXiv Detail & Related papers (2024-06-11T11:25:37Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation [57.539634387672656]
Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality.
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation.
arXiv Detail & Related papers (2023-12-04T18:58:38Z) - Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z) - LaMD: Latent Motion Diffusion for Image-Conditional Video Generation [63.34574080016687]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator.
LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN.
arXiv Detail & Related papers (2023-04-23T10:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.