Proteno: Text Normalization with Limited Data for Fast Deployment in
Text to Speech Systems
- URL: http://arxiv.org/abs/2104.07777v1
- Date: Thu, 15 Apr 2021 21:14:28 GMT
- Title: Proteno: Text Normalization with Limited Data for Fast Deployment in
Text to Speech Systems
- Authors: Shubhi Tyagi, Antonio Bonafonte, Jaime Lorenzo-Trueba, Javier Latorre
- Abstract summary: Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard.
We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English.
We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English.
- Score: 15.401574286479546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new
languages is hard. We propose a novel architecture to facilitate it for
multiple languages while using data less than 3% of the size of the data used
by the state of the art results on English. We treat TN as a sequence
classification problem and propose a granular tokenization mechanism that
enables the system to learn majority of the classes and their normalizations
from the training data itself. This is further combined with minimal precoded
linguistic knowledge for other classes. We publish the first results on TN for
TTS in Spanish and Tamil and also demonstrate that the performance of the
approach is comparable with the previous work done on English. All annotated
datasets used for experimentation will be released at
https://github.com/amazon-research/proteno.
Related papers
- Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Unified Transformer-based Framework for Duplex Text Normalization [33.90810154067128]
Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for text-to-speech synthesis and automatic speech recognition.
We propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN.
Our systems achieve state-of-the-art results on the Google TN dataset for English and Russian.
arXiv Detail & Related papers (2021-08-23T01:55:03Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.