Enhancing Speech Intelligibility in Text-To-Speech Synthesis using
Speaking Style Conversion
- URL: http://arxiv.org/abs/2008.05809v1
- Date: Thu, 13 Aug 2020 10:51:56 GMT
- Title: Enhancing Speech Intelligibility in Text-To-Speech Synthesis using
Speaking Style Conversion
- Authors: Dipjyoti Paul, Muhammed PV Shifas, Yannis Pantazis, Yannis Stylianou
- Abstract summary: We propose a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis.
The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC)
Intelligibility enhancement as quantified by the Intelligibility in Bits measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN) and 47% to 140% in competing-speaker noise (CSN)
- Score: 17.520533341887642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increased adoption of digital assistants makes text-to-speech (TTS)
synthesis systems an indispensable feature of modern mobile devices. It is
hence desirable to build a system capable of generating highly intelligible
speech in the presence of noise. Past studies have investigated style
conversion in TTS synthesis, yet degraded synthesized quality often leads to
worse intelligibility. To overcome such limitations, we proposed a novel
transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The
proposed speech system exploits two modification strategies: (a) Lombard
speaking style data and (b) Spectral Shaping and Dynamic Range Compression
(SSDRC) which has been shown to provide high intelligibility gains by
redistributing the signal energy on the time-frequency domain. We refer to this
extension as Lombard-SSDRC TTS system. Intelligibility enhancement as
quantified by the Intelligibility in Bits (SIIB-Gauss) measure shows that the
proposed Lombard-SSDRC TTS system shows significant relative improvement
between 110% and 130% in speech-shaped noise (SSN), and 47% to 140% in
competing-speaker noise (CSN) against the state-of-the-art TTS approach.
Additional subjective evaluation shows that Lombard-SSDRC TTS successfully
increases the speech intelligibility with relative improvement of 455% for SSN
and 104% for CSN in median keyword correction rate compared to the baseline TTS
method.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Noise-robust zero-shot text-to-speech synthesis conditioned on
self-supervised speech-representation model with adapters [47.75276947690528]
The zero-shot text-to-speech (TTS) method can reproduce speaker characteristics very accurately.
However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise.
In this paper, we propose a noise-robust zero-shot TTS method.
arXiv Detail & Related papers (2024-01-10T12:21:21Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric
Speech Recognition [4.637732011720613]
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility.
To have robust dysarthria-specific ASR, sufficient training speech is required.
Recent advances in Text-To-Speech synthesis suggest the possibility of using synthesis for data augmentation.
arXiv Detail & Related papers (2022-01-27T15:22:09Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Incremental Speech Synthesis For Speech-To-Speech Translation [23.951060578077445]
We focus on improving the incremental synthesis performance of TTS models.
With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance.
We propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.
arXiv Detail & Related papers (2021-10-15T17:20:28Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.