Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion
Models
- URL: http://arxiv.org/abs/2211.09707v2
- Date: Tue, 16 May 2023 17:59:58 GMT
- Title: Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion
Models
- Authors: Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter
- Abstract summary: We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio.
We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power.
Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
- Score: 22.000197530493445
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion models have experienced a surge of interest as highly expressive
yet efficiently trainable probabilistic models. We show that these models are
an excellent fit for synthesising human motion that co-occurs with audio, e.g.,
dancing and co-speech gesticulation, since motion is complex and highly
ambiguous given audio, calling for a probabilistic description. Specifically,
we adapt the DiffWave architecture to model 3D pose sequences, putting
Conformers in place of dilated convolutions for improved modelling power. We
also demonstrate control over motion style, using classifier-free guidance to
adjust the strength of the stylistic expression. Experiments on gesture and
dance generation confirm that the proposed method achieves top-of-the-line
motion quality, with distinctive styles whose expression can be made more or
less pronounced. We also synthesise path-driven locomotion using the same model
architecture. Finally, we generalise the guidance procedure to obtain
product-of-expert ensembles of diffusion models and demonstrate how these may
be used for, e.g., style interpolation, a contribution we believe is of
independent interest. See
https://www.speech.kth.se/research/listen-denoise-action/ for video examples,
data, and code.
Related papers
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion
Model [3.8084817124151726]
This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models.
The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module.
Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation.
arXiv Detail & Related papers (2023-01-24T14:44:03Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion
Model [7.381316531478522]
We propose a simple and novel method for generating 3D human motion from complex natural language sentences.
We use the Denoising Diffusion Probabilistic Model to generate diverse motion results under the guidance of texts.
Our experiments demonstrate that our model competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples.
arXiv Detail & Related papers (2022-10-22T00:41:17Z) - Denoising Diffusion Probabilistic Models for Styled Walking Synthesis [9.789705536694665]
We propose a framework using the denoising diffusion probabilistic model (DDPM) to synthesize styled human motions.
Experimental results show that our system can generate high-quality and diverse walking motions.
arXiv Detail & Related papers (2022-09-29T14:45:33Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.