Referee: Towards reference-free cross-speaker style transfer with
low-quality data for expressive speech synthesis
- URL: http://arxiv.org/abs/2109.03439v1
- Date: Wed, 8 Sep 2021 05:39:34 GMT
- Title: Referee: Towards reference-free cross-speaker style transfer with
low-quality data for expressive speech synthesis
- Authors: Songxiang Liu, Shan Yang, Dan Su, Dong Yu
- Abstract summary: Cross-speaker style transfer (CSST) in text-to-speech (TTS) aims at transferring a speaking style to the synthesised speech in a target speaker's voice.
This presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text.
- Score: 39.730034713382736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at
transferring a speaking style to the synthesised speech in a target speaker's
voice. Most previous CSST approaches rely on expensive high-quality data
carrying desired speaking style during training and require a reference
utterance to obtain speaking style descriptors as conditioning on the
generation of a new sentence. This work presents Referee, a robust
reference-free CSST approach for expressive TTS, which fully leverages
low-quality data to learn speaking styles from text. Referee is built by
cascading a text-to-style (T2S) model with a style-to-wave (S2W) model.
Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are
adopted as fine-grained speaking style descriptors, which are predicted from
text using the T2S model. A novel pretrain-refinement method is adopted to
learn a robust T2S model by only using readily accessible low-quality data. The
S2W model is trained with high-quality target data, which is adopted to
effectively aggregate style descriptors and generate high-fidelity speech in
the target speaker's voice. Experimental results are presented, showing that
Referee outperforms a global-style-token (GST)-based baseline approach in CSST.
Related papers
- LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning [12.069474749489897]
We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics.
Results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset.
arXiv Detail & Related papers (2024-06-12T07:49:21Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion
and Adversarial Training with Large Speech Language Models [19.029030168939354]
StyleTTS 2 is a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers.
This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
arXiv Detail & Related papers (2023-06-13T11:04:43Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech
with Untranscribed Data [25.709370310448328]
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data.
We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method.
We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a tensecond untranscribed data.
arXiv Detail & Related papers (2022-05-30T18:30:20Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.