Text Generation with Speech Synthesis for ASR Data Augmentation
- URL: http://arxiv.org/abs/2305.16333v1
- Date: Mon, 22 May 2023 18:45:20 GMT
- Title: Text Generation with Speech Synthesis for ASR Data Augmentation
- Authors: Zhuangqun Huang, Gil Keren, Ziran Jiang, Shashank Jain, David
Goss-Grubbs, Nelson Cheng, Farnaz Abtahi, Duc Le, David Zhang, Antony
D'Avirro, Ethan Campbell-Taylor, Jessie Salas, Irina-Elena Veliche, Xi Chen
- Abstract summary: We explore text augmentation for Automatic Speech Recognition (ASR) using large-scale pre-trained neural networks.
We find that neural models achieve 9%-15% relative WER improvement and outperform traditional methods.
- Score: 17.348764629839636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aiming at reducing the reliance on expensive human annotations, data
synthesis for Automatic Speech Recognition (ASR) has remained an active area of
research. While prior work mainly focuses on synthetic speech generation for
ASR data augmentation, its combination with text generation methods is
considerably less explored. In this work, we explore text augmentation for ASR
using large-scale pre-trained neural networks, and systematically compare those
to traditional text augmentation methods. The generated synthetic texts are
then converted to synthetic speech using a text-to-speech (TTS) system and
added to the ASR training data. In experiments conducted on three datasets, we
find that neural models achieve 9%-15% relative WER improvement and outperform
traditional methods. We conclude that text augmentation, particularly through
modern neural approaches, is a viable tool for improving the accuracy of ASR
systems.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures [19.823015917720284]
We evaluate the utility of synthetic data for training automatic speech recognition.
We reproduce the original training data, training ASR systems solely on synthetic data.
We show that the TTS models generalize well, even when training scores indicate overfitting.
arXiv Detail & Related papers (2024-07-25T12:44:45Z) - Text Injection for Neural Contextual Biasing [57.589903308622745]
This work proposes contextual text injection (CTI) to enhance contextual ASR.
CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
arXiv Detail & Related papers (2024-06-05T04:20:17Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - Boosting Punctuation Restoration with Data Generation and Reinforcement
Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR)
The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts.
This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z) - Towards Selection of Text-to-speech Data to Augment ASR Training [20.115236045164355]
We train a neural network to measure the similarity of a synthetic data to real speech.
We find that incorporating synthetic samples with considerable dissimilarity to real speech is crucial for boosting recognition performance.
arXiv Detail & Related papers (2023-05-30T17:24:28Z) - Text-To-Speech Data Augmentation for Low Resource Speech Recognition [0.0]
This research proposes a new data augmentation method to improve ASR models for agglutinative and low-resource languages.
Experiments were conducted using the corpus of the Quechua language, which is an agglutinative and low-resource language.
An 8.73% improvement in the word-error-rate (WER) of the ASR model is obtained using a combination of synthetic text and synthetic speech.
arXiv Detail & Related papers (2022-04-01T08:53:44Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.