Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme
Representations for Text to Speech
- URL: http://arxiv.org/abs/2203.17190v1
- Date: Thu, 31 Mar 2022 17:12:26 GMT
- Title: Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme
Representations for Text to Speech
- Authors: Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu,
Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao
- Abstract summary: We propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability.
Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline.
- Score: 104.65639892109381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, leveraging BERT pre-training to improve the phoneme encoder in text
to speech (TTS) has drawn increasing attention. However, the works apply
pre-training with character-based units to enhance the TTS phoneme encoder,
which is inconsistent with the TTS fine-tuning that takes phonemes as input.
Pre-training only with phonemes as input can alleviate the input mismatch but
lack the ability to model rich representations and semantic information due to
limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a
novel variant of the BERT model that uses mixed phoneme and sup-phoneme
representations to enhance the learning capability. Specifically, we merge the
adjacent phonemes into sup-phonemes and combine the phoneme sequence and the
merged sup-phoneme sequence as the model input, which can enhance the model
capacity to learn rich contextual representations. Experiment results
demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS
performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The
Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to
the previous TTS pre-trained model PnG BERT
Related papers
- Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations
for Text-to-Speech [15.254598796939922]
We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task.
Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales.
arXiv Detail & Related papers (2023-05-31T10:05:33Z) - Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme
Predictions [20.03948836281806]
We propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions.
Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech.
arXiv Detail & Related papers (2023-01-20T21:36:16Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS [27.20479869682578]
PnG BERT is a new encoder model for neural TTS.
It can be pre-trained on a large text corpus in a self-supervised manner.
arXiv Detail & Related papers (2021-03-28T06:24:00Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.