Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2303.09119v2
- Date: Sat, 18 Mar 2023 10:11:51 GMT
- Title: Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
- Authors: Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu
- Abstract summary: We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
- Score: 41.292644854306594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Animating virtual avatars to make co-speech gestures facilitates various
applications in human-machine interaction. The existing methods mainly rely on
generative adversarial networks (GANs), which typically suffer from notorious
mode collapse and unstable training, thus making it difficult to learn accurate
audio-gesture joint distributions. In this work, we propose a novel
diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to
effectively capture the cross-modal audio-to-gesture associations and preserve
temporal coherence for high-fidelity audio-driven co-speech gesture generation.
Specifically, we first establish the diffusion-conditional generation process
on clips of skeleton sequences and audio to enable the whole framework. Then, a
novel Diffusion Audio-Gesture Transformer is devised to better attend to the
information from multiple modalities and model the long-term temporal
dependency. Moreover, to eliminate temporal inconsistency, we propose an
effective Diffusion Gesture Stabilizer with an annealed noise sampling
strategy. Benefiting from the architectural advantages of diffusion models, we
further incorporate implicit classifier-free guidance to trade off between
diversity and gesture quality. Extensive experiments demonstrate that
DiffGesture achieves state-of-theart performance, which renders coherent
gestures with better mode coverage and stronger audio correlations. Code is
available at https://github.com/Advocate99/DiffGesture.
Related papers
- DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures [27.763304632981882]
We introduce DiffTED, a new approach for one-shot audio-driven talking video generation from a single image.
We leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model.
Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.
arXiv Detail & Related papers (2024-09-11T22:31:55Z) - A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation [32.648815593259485]
Training diffusion models for audiovisual sequences allows for a range of generation tasks.
We propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.
arXiv Detail & Related papers (2024-05-22T15:47:14Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - TransFusion: Transcribing Speech with Multinomial Diffusion [20.165433724198937]
We propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features.
We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark.
We also propose new techniques for effectively sampling and decoding multinomial diffusion models.
arXiv Detail & Related papers (2022-10-14T10:01:43Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.