Enhancing audio quality for expressive Neural Text-to-Speech
- URL: http://arxiv.org/abs/2108.06270v1
- Date: Fri, 13 Aug 2021 14:32:39 GMT
- Title: Enhancing audio quality for expressive Neural Text-to-Speech
- Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa,
Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime
Lorenzo-Trueba, Viacheslav Klimkov
- Abstract summary: We present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data.
We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.
- Score: 8.199224915764672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as
recent Text-to-Speech (TTS) systems are capable of producing speech with
similar quality to human recordings. However, not all speaking styles are easy
to model: highly expressive voices are still challenging even to recent TTS
architectures since there seems to be a trade-off between expressiveness in a
generated audio and its signal quality. In this paper, we present a set of
techniques that can be leveraged to enhance the signal quality of a
highly-expressive voice without the use of additional data. The proposed
techniques include: tuning the autoregressive loop's granularity during
training; using Generative Adversarial Networks in acoustic modelling; and the
use of Variational Auto-Encoders in both the acoustic model and the neural
vocoder. We show that, when combined, these techniques greatly closed the gap
in perceived naturalness between the baseline system and recordings by 39% in
terms of MUSHRA scores for an expressive celebrity voice.
Related papers
- Coding Speech through Vocal Tract Kinematics [5.0751585360524425]
Articulatory features are traces of kinematic shapes of vocal tract articulators and source features, which are intuitively interpretable and controllable.
Speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion.
arXiv Detail & Related papers (2024-06-18T18:38:17Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser.
We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised.
We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data.
We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers.
Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.