Interactive Character Control with Auto-Regressive Motion Diffusion Models
- URL: http://arxiv.org/abs/2306.00416v4
- Date: Fri, 16 Aug 2024 01:07:21 GMT
- Title: Interactive Character Control with Auto-Regressive Motion Diffusion Models
- Authors: Yi Shi, Jingbo Wang, Xuekun Jiang, Bingkun Lin, Bo Dai, Xue Bin Peng,
- Abstract summary: We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
- Score: 18.727066177880708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time character control is an essential component for interactive experiences, with a broad range of applications, including physics simulations, video games, and virtual reality. The success of diffusion models for image synthesis has led to the use of these models for motion synthesis. However, the majority of these motion diffusion models are primarily designed for offline applications, where space-time models are used to synthesize an entire sequence of frames simultaneously with a pre-specified length. To enable real-time motion synthesis with diffusion model that allows time-varying controls, we propose A-MDM (Auto-regressive Motion Diffusion Model). Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on the previous frame. Despite its streamlined network architecture, which uses simple MLPs, our framework is capable of generating diverse, long-horizon, and high-fidelity motion sequences. Furthermore, we introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning. These techniques enable a pre-trained A-MDM to be efficiently adapted for a variety of new downstream tasks. We conduct a comprehensive suite of experiments to demonstrate the effectiveness of A-MDM, and compare its performance against state-of-the-art auto-regressive methods.
Related papers
- Large Body Language Models [1.9797215742507548]
We introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video)
LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Freche's Gesture Distance (FGD) and a 25% improvement in Freche's Inception Distance compared to existing approaches.
arXiv Detail & Related papers (2024-10-21T21:48:24Z) - Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI)
General Implicit Motion Modeling (IMM) is a novel and effective approach to motion modeling VFI.
Our GIMM can be smoothly integrated with existing flow-based VFI works without further modifications.
arXiv Detail & Related papers (2024-07-11T17:13:15Z) - Shape Conditioned Human Motion Generation with Diffusion Model [0.0]
We propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format.
We also propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain.
arXiv Detail & Related papers (2024-05-10T19:06:41Z) - MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models [22.044020889631188]
We introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration.
Our method matches or exceeds the performance of state-of-the-art models.
arXiv Detail & Related papers (2024-03-14T15:10:54Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - AAMDM: Accelerated Auto-regressive Motion Diffusion Model [10.94879097495769]
This paper introduces the Accelerated Auto-regressive Motion Diffusion Model (AAMDM)
AAMDM is a novel motion synthesis framework designed to achieve quality, diversity, and efficiency all together.
We show that AAMDM outperforms existing methods in motion quality, diversity, and runtime efficiency.
arXiv Detail & Related papers (2023-12-02T23:52:21Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - Controllable Motion Synthesis and Reconstruction with Autoregressive
Diffusion Models [18.50942770933098]
MoDiff is an autoregressive probabilistic diffusion model over motion sequences conditioned on control contexts of other modalities.
Our model integrates a cross-modal Transformer encoder and a Transformer-based decoder, which are found effective in capturing temporal correlations in motion and control modalities.
arXiv Detail & Related papers (2023-04-03T08:17:08Z) - Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs.
Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z) - Conditional Temporal Variational AutoEncoder for Action Video Prediction [66.63038712306606]
ACT-VAE predicts pose sequences for an action clips from a single input image.
When connected with a plug-and-play Pose-to-Image (P2I) network, ACT-VAE can synthesize image sequences.
arXiv Detail & Related papers (2021-08-12T10:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.