Related papers: DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

URL: http://arxiv.org/abs/2301.10047v1
Date: Tue, 24 Jan 2023 14:44:03 GMT
Title: DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model
Authors: Fan Zhang, Naye Ji, Fuxing Gao, Yongping Li
Abstract summary: This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation.
Score: 3.8084817124151726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech-driven gesture synthesis is a field of growing interest in virtual human creation. However, a critical challenge is the inherent intricate one-to-many mapping between speech and gestures. Previous studies have explored and achieved significant progress with generative models. Notwithstanding, most synthetic gestures are still vastly less natural. This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. The encoder extracts the temporal context of the speech input and historical gestures. The diffusion module learns a parameterized Markov chain to gradually convert a simple distribution into a complex distribution and generates the gestures according to the accompanied speech. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation and demonstrate the benefits of diffusion-based models on speech-driven gesture synthesis.

Related papers

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech [42.663766380488205]
DIDiffGes can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness.
arXiv Detail & Related papers (2025-03-21T11:23:39Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models. Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z)
UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention. Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z)
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z)
Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes. We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z)
DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z)
A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z)
Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z)
Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models [22.000197530493445]
We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio. We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
arXiv Detail & Related papers (2022-11-17T17:41:00Z)
Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture. Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.