A Unified Transformer-based Framework for Duplex Text Normalization
- URL: http://arxiv.org/abs/2108.09889v1
- Date: Mon, 23 Aug 2021 01:55:03 GMT
- Title: A Unified Transformer-based Framework for Duplex Text Normalization
- Authors: Tuan Manh Lai, Yang Zhang, Evelina Bakhturina, Boris Ginsburg, Heng Ji
- Abstract summary: Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for text-to-speech synthesis and automatic speech recognition.
We propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN.
Our systems achieve state-of-the-art results on the Google TN dataset for English and Russian.
- Score: 33.90810154067128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text normalization (TN) and inverse text normalization (ITN) are essential
preprocessing and postprocessing steps for text-to-speech synthesis and
automatic speech recognition, respectively. Many methods have been proposed for
either TN or ITN, ranging from weighted finite-state transducers to neural
networks. Despite their impressive performance, these methods aim to tackle
only one of the two tasks but not both. As a result, in a complete spoken
dialog system, two separate models for TN and ITN need to be built. This
heterogeneity increases the technical complexity of the system, which in turn
increases the cost of maintenance in a production setting. Motivated by this
observation, we propose a unified framework for building a single neural duplex
system that can simultaneously handle TN and ITN. Combined with a simple but
effective data augmentation method, our systems achieve state-of-the-art
results on the Google TN dataset for English and Russian. They can also reach
over 95% sentence-level accuracy on an internal English TN dataset without any
additional fine-tuning. In addition, we also create a cleaned dataset from the
Spoken Wikipedia Corpora for German and report the performance of our systems
on the dataset. Overall, experimental results demonstrate the proposed duplex
text normalization framework is highly effective and applicable to a range of
domains and languages
Related papers
- Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Language Agnostic Data-Driven Inverse Text Normalization [6.43601166279978]
inverse text normalization (ITN) problem attracts the attention of researchers from various fields.
Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited.
We propose a language-agnostic data-driven ITN framework to fill this gap.
arXiv Detail & Related papers (2023-01-20T10:33:03Z) - Improving Data Driven Inverse Text Normalization using Data Augmentation [14.820077884045645]
Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form.
We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data.
We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
arXiv Detail & Related papers (2022-07-20T06:07:26Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - Proteno: Text Normalization with Limited Data for Fast Deployment in
Text to Speech Systems [15.401574286479546]
Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard.
We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English.
We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English.
arXiv Detail & Related papers (2021-04-15T21:14:28Z) - Neural Inverse Text Normalization [11.240669509034298]
We propose an efficient and robust neural solution for inverse text normalization.
We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them.
A transformer based model infused with pretraining consistently achieves a lower WER across several datasets.
arXiv Detail & Related papers (2021-02-12T07:53:53Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation
with Semantic Fidelity [3.8673630752805432]
We present DataTuner, a neural, end-to-end data-to-text generation system.
We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity.
We show that DataTuner achieves state of the art results on the automated metrics across four major D2T datasets.
arXiv Detail & Related papers (2020-04-08T11:16:53Z) - Few-shot Natural Language Generation for Task-Oriented Dialog [113.07438787659859]
We present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems.
We develop the SC-GPT model, which is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability.
Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods.
arXiv Detail & Related papers (2020-02-27T18:48:33Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.