Unified Mandarin TTS Front-end Based on Distilled BERT Model
- URL: http://arxiv.org/abs/2012.15404v1
- Date: Thu, 31 Dec 2020 02:34:57 GMT
- Title: Unified Mandarin TTS Front-end Based on Distilled BERT Model
- Authors: Yang Zhang, Liqun Deng, Yasheng Wang
- Abstract summary: A pre-trained language model (PLM) based model is proposed to tackle the two most important tasks in TTS front-end.
We use a pre-trained Chinese BERT as the text encoder and employ multi-task learning technique to adapt it to the two TTS front-end tasks.
We are able to run the whole TTS front-end module in a light and unified manner, which is more friendly to deployment on mobile devices.
- Score: 5.103126953298633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The front-end module in a typical Mandarin text-to-speech system (TTS) is
composed of a long pipeline of text processing components, which requires
extensive efforts to build and is prone to large accumulative model size and
cascade errors. In this paper, a pre-trained language model (PLM) based model
is proposed to simultaneously tackle the two most important tasks in TTS
front-end, i.e., prosodic structure prediction (PSP) and grapheme-to-phoneme
(G2P) conversion. We use a pre-trained Chinese BERT[1] as the text encoder and
employ multi-task learning technique to adapt it to the two TTS front-end
tasks. Then, the BERT encoder is distilled into a smaller model by employing a
knowledge distillation technique called TinyBERT[2], making the whole model
size 25% of that of benchmark pipeline models while maintaining competitive
performance on both tasks. With the proposed the methods, we are able to run
the whole TTS front-end module in a light and unified manner, which is more
friendly to deployment on mobile devices.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation [27.78435674869292]
Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech model.
We propose to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis.
arXiv Detail & Related papers (2024-06-25T03:50:54Z) - Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper.
Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner.
Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z) - Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale [36.584680344291556]
We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts.
GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training.
GPST significantly outperforms existing unsupervised SLMs on left-to-right grammar induction.
arXiv Detail & Related papers (2024-03-13T06:54:47Z) - Tuning Large language model for End-to-end Speech Translation [7.297914077124909]
This paper introduces LST, a large multimodal model designed to excel at the E2E-ST task.
Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art.
arXiv Detail & Related papers (2023-10-03T13:43:50Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - ESPnet2-TTS: Extending the Edge of TTS Research [62.92178873052468]
ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit.
New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
arXiv Detail & Related papers (2021-10-15T03:27:45Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.