Controllable Emphasis with zero data for text-to-speech
- URL: http://arxiv.org/abs/2307.07062v1
- Date: Thu, 13 Jul 2023 21:06:23 GMT
- Title: Controllable Emphasis with zero data for text-to-speech
- Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi,
Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet,
Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena
Sokolova
- Abstract summary: A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
- Score: 57.12383531339368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a scalable method to produce high quality emphasis for
text-to-speech (TTS) that does not require recordings or annotations. Many TTS
models include a phoneme duration model. A simple but effective method to
achieve emphasized speech consists in increasing the predicted duration of the
emphasised word. We show that this is significantly better than spectrogram
modification techniques improving naturalness by $7.3\%$ and correct testers'
identification of the emphasized word in a sentence by $40\%$ on a reference
female en-US voice. We show that this technique significantly closes the gap to
methods that require explicit recordings. The method proved to be scalable and
preferred in all four languages tested (English, Spanish, Italian, German), for
different voices and multiple speaking styles.
Related papers
- Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Distribution augmentation for low-resource expressive text-to-speech [18.553812159109253]
This paper presents a novel data augmentation technique for text-to-speech (TTS)
It allows to generate new (text, audio) training examples without requiring any additional data.
arXiv Detail & Related papers (2022-02-13T21:19:31Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.