PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
- URL: http://arxiv.org/abs/2103.15060v1
- Date: Sun, 28 Mar 2021 06:24:00 GMT
- Title: PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
- Authors: Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu
- Abstract summary: PnG BERT is a new encoder model for neural TTS.
It can be pre-trained on a large text corpus in a self-supervised manner.
- Score: 27.20479869682578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This
model is augmented from the original BERT model, by taking both phoneme and
grapheme representations of text as input, as well as the word-level alignment
between them. It can be pre-trained on a large text corpus in a self-supervised
manner, and fine-tuned in a TTS task. Experimental results show that a neural
TTS model using a pre-trained PnG BERT as its encoder yields more natural
prosody and more accurate pronunciation than a baseline model using only
phoneme input with no pre-training. Subjective side-by-side preference
evaluations show that raters have no statistically significant preference
between the speech synthesized using a PnG BERT and ground truth recordings
from professional speakers.
Related papers
- Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme
Predictions [20.03948836281806]
We propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions.
Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech.
arXiv Detail & Related papers (2023-01-20T21:36:16Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - The Topological BERT: Transforming Attention into Topology for Natural
Language Processing [0.0]
This paper introduces a text classifier using topological data analysis.
We use BERT's attention maps transformed into attention graphs as the only input to that classifier.
The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive.
arXiv Detail & Related papers (2022-06-30T11:25:31Z) - Automatic Prosody Annotation with Pre-Trained Text-Speech Model [48.47706377700962]
We propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders.
This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: speech, text, prosody
arXiv Detail & Related papers (2022-06-16T06:54:16Z) - Neural Grapheme-to-Phoneme Conversion with Pre-trained Grapheme Models [35.60380484684335]
This paper proposes a pre-trained grapheme model called grapheme BERT (GBERT)
GBERT is built by self-supervised training on a large, language-specific word list with only grapheme information.
Two approaches are developed to incorporate GBERT into the state-of-the-art Transformer-based G2P model.
arXiv Detail & Related papers (2022-01-26T02:49:56Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Unified Mandarin TTS Front-end Based on Distilled BERT Model [5.103126953298633]
A pre-trained language model (PLM) based model is proposed to tackle the two most important tasks in TTS front-end.
We use a pre-trained Chinese BERT as the text encoder and employ multi-task learning technique to adapt it to the two TTS front-end tasks.
We are able to run the whole TTS front-end module in a light and unified manner, which is more friendly to deployment on mobile devices.
arXiv Detail & Related papers (2020-12-31T02:34:57Z) - GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations.
We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework.
Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.