Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis
- URL: http://arxiv.org/abs/2306.09417v3
- Date: Wed, 9 Aug 2023 12:41:48 GMT
- Title: Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis
- Authors: Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, \'Eva
Sz\'ekely, Gustav Eje Henter
- Abstract summary: We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
- Score: 19.35266496960533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is
a growing research interest in synthesising spontaneous speech. However, human
spontaneous face-to-face conversation has both spoken and non-verbal aspects
(here, co-speech gestures). Only recently has research begun to explore the
benefits of jointly synthesising these two modalities in a single system. The
previous state of the art used non-probabilistic methods, which fail to capture
the variability of human speech and motion, and risk producing oversmoothing
artefacts and sub-optimal synthesis quality. We present the first
diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to
synthesise speech and gestures together. Our method can be trained on small
datasets from scratch. Furthermore, we describe a set of careful uni- and
multi-modal subjective tests for evaluating integrated speech and gesture
synthesis systems, and use them to validate our proposed approach. Please see
https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Unified speech and gesture synthesis using flow matching [24.2094371314481]
This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text.
The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
arXiv Detail & Related papers (2023-10-08T14:37:28Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities.
We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG)
Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z) - Generating coherent spontaneous speech and gesture from text [21.90157862281996]
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements)
Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data.
We put these two state-of-the-art technologies together in a coherent fashion for the first time.
arXiv Detail & Related papers (2021-01-14T16:02:21Z) - Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z) - Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning [6.514358246805895]
We propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system.
We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations.
arXiv Detail & Related papers (2020-08-20T09:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.