Improving Data Driven Inverse Text Normalization using Data Augmentation
- URL: http://arxiv.org/abs/2207.09674v1
- Date: Wed, 20 Jul 2022 06:07:26 GMT
- Title: Improving Data Driven Inverse Text Normalization using Data Augmentation
- Authors: Laxmi Pandey, Debjyoti Paul, Pooja Chitkara, Yutong Pang, Xuedong
Zhang, Kjell Schubert, Mark Chou, Shu Liu, Yatharth Saraf
- Abstract summary: Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form.
We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data.
We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
- Score: 14.820077884045645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverse text normalization (ITN) is used to convert the spoken form output of
an automatic speech recognition (ASR) system to a written form. Traditional
handcrafted ITN rules can be complex to transcribe and maintain. Meanwhile
neural modeling approaches require quality large-scale spoken-written pair
examples in the same or similar domain as the ASR system (in-domain data), to
train. Both these approaches require costly and complex annotations. In this
paper, we present a data augmentation technique that effectively generates rich
spoken-written numeric pairs from out-of-domain textual data with minimal human
annotation. We empirically demonstrate that ITN model trained using our data
augmentation technique consistently outperform ITN model trained using only
in-domain data across all numeric surfaces like cardinal, currency, and
fraction, by an overall accuracy of 14.44%.
Related papers
- Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z) - Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Handling Numeric Expressions in Automatic Speech Recognition [56.972851337263755]
We compare cascaded and end-to-end approaches to recognize and format numeric expression.
Results show that adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
arXiv Detail & Related papers (2024-07-18T09:46:19Z) - Improving Robustness of Neural Inverse Text Normalization via
Data-Augmentation, Semi-Supervised Learning, and Post-Aligning Method [4.343606621506086]
Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR)
We propose a direct training approach that utilizes ASR-generated written or spoken text, with pairs augmented through ASR linguistic context emulation and a semi-supervised learning method enhanced by a large language model.
Our proposed methods remarkably improved ITN performance in various ASR scenarios.
arXiv Detail & Related papers (2023-09-12T06:05:57Z) - Text-only domain adaptation for end-to-end ASR using integrated
text-to-mel-spectrogram generator [17.44686265224974]
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.
We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only.
arXiv Detail & Related papers (2023-02-27T18:47:55Z) - A Simple Baseline for Domain Adaptation in End to End ASR Systems Using
Synthetic Data [1.14219428942199]
We propose a simple baseline technique for domain adaptation in end-to-end speech recognition models.
We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine.
We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates.
arXiv Detail & Related papers (2022-06-22T12:07:38Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - A Unified Transformer-based Framework for Duplex Text Normalization [33.90810154067128]
Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for text-to-speech synthesis and automatic speech recognition.
We propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN.
Our systems achieve state-of-the-art results on the Google TN dataset for English and Russian.
arXiv Detail & Related papers (2021-08-23T01:55:03Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Few-shot Natural Language Generation for Task-Oriented Dialog [113.07438787659859]
We present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems.
We develop the SC-GPT model, which is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability.
Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods.
arXiv Detail & Related papers (2020-02-27T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.