Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme
Predictions
- URL: http://arxiv.org/abs/2301.08810v1
- Date: Fri, 20 Jan 2023 21:36:16 GMT
- Title: Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme
Predictions
- Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
- Abstract summary: We propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions.
Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech.
- Score: 20.03948836281806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-trained language models have been shown to be helpful in
improving the naturalness of text-to-speech (TTS) models by enabling them to
produce more naturalistic prosodic patterns. However, these models are usually
word-level or sup-phoneme-level and jointly trained with phonemes, making them
inefficient for the downstream TTS task where only phonemes are needed. In this
work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of
predicting the corresponding graphemes along with the regular masked phoneme
predictions. Subjective evaluations show that our phoneme-level BERT encoder
has significantly improved the mean opinion scores (MOS) of rated naturalness
of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS
baseline on out-of-distribution (OOD) texts.
Related papers
- From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes [6.726629754291751]
We develop a pipeline to convert text datasets into a continuous stream of phonemes.
We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge.
Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.
arXiv Detail & Related papers (2024-10-30T11:05:01Z) - Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme
Representations for Text to Speech [104.65639892109381]
We propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability.
Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline.
arXiv Detail & Related papers (2022-03-31T17:12:26Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS [27.20479869682578]
PnG BERT is a new encoder model for neural TTS.
It can be pre-trained on a large text corpus in a self-supervised manner.
arXiv Detail & Related papers (2021-03-28T06:24:00Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.