DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion
Model
- URL: http://arxiv.org/abs/2301.10047v1
- Date: Tue, 24 Jan 2023 14:44:03 GMT
- Title: DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion
Model
- Authors: Fan Zhang, Naye Ji, Fuxing Gao, Yongping Li
- Abstract summary: This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models.
The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module.
Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation.
- Score: 3.8084817124151726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech-driven gesture synthesis is a field of growing interest in virtual
human creation. However, a critical challenge is the inherent intricate
one-to-many mapping between speech and gestures. Previous studies have explored
and achieved significant progress with generative models. Notwithstanding, most
synthetic gestures are still vastly less natural. This paper presents
DiffMotion, a novel speech-driven gesture synthesis architecture based on
diffusion models. The model comprises an autoregressive temporal encoder and a
denoising diffusion probability Module. The encoder extracts the temporal
context of the speech input and historical gestures. The diffusion module
learns a parameterized Markov chain to gradually convert a simple distribution
into a complex distribution and generates the gestures according to the
accompanied speech. Compared with baselines, objective and subjective
evaluations confirm that our approach can produce natural and diverse
gesticulation and demonstrate the benefits of diffusion-based models on
speech-driven gesture synthesis.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion
Models [22.000197530493445]
We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio.
We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power.
Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
arXiv Detail & Related papers (2022-11-17T17:41:00Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.