Related papers: Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

URL: http://arxiv.org/abs/2211.09707v2
Date: Tue, 16 May 2023 17:59:58 GMT
Title: Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
Authors: Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter
Abstract summary: We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio. We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
Score: 22.000197530493445
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.

Related papers

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech. We present a novel hybrid point representation to achieve accurate and continuous motion generation. We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z)
Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes. We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z)
DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model [3.8084817124151726]
This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation.
arXiv Detail & Related papers (2023-01-24T14:44:03Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)
Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model [7.381316531478522]
We propose a simple and novel method for generating 3D human motion from complex natural language sentences. We use the Denoising Diffusion Probabilistic Model to generate diverse motion results under the guidance of texts. Our experiments demonstrate that our model competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples.
arXiv Detail & Related papers (2022-10-22T00:41:17Z)
Denoising Diffusion Probabilistic Models for Styled Walking Synthesis [9.789705536694665]
We propose a framework using the denoising diffusion probabilistic model (DDPM) to synthesize styled human motions. Experimental results show that our system can generate high-quality and diverse walking motions.
arXiv Detail & Related papers (2022-09-29T14:45:33Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture. Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.