Text-To-Speech Data Augmentation for Low Resource Speech Recognition
- URL: http://arxiv.org/abs/2204.00291v1
- Date: Fri, 1 Apr 2022 08:53:44 GMT
- Title: Text-To-Speech Data Augmentation for Low Resource Speech Recognition
- Authors: Rodolfo Zevallos
- Abstract summary: This research proposes a new data augmentation method to improve ASR models for agglutinative and low-resource languages.
Experiments were conducted using the corpus of the Quechua language, which is an agglutinative and low-resource language.
An 8.73% improvement in the word-error-rate (WER) of the ASR model is obtained using a combination of synthetic text and synthetic speech.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Nowadays, the main problem of deep learning techniques used in the
development of automatic speech recognition (ASR) models is the lack of
transcribed data. The goal of this research is to propose a new data
augmentation method to improve ASR models for agglutinative and low-resource
languages. This novel data augmentation method generates both synthetic text
and synthetic audio. Some experiments were conducted using the corpus of the
Quechua language, which is an agglutinative and low-resource language. In this
study, a sequence-to-sequence (seq2seq) model was applied to generate synthetic
text, in addition to generating synthetic speech using a text-to-speech (TTS)
model for Quechua. The results show that the new data augmentation method works
well to improve the ASR model for Quechua. In this research, an 8.73%
improvement in the word-error-rate (WER) of the ASR model is obtained using a
combination of synthetic text and synthetic speech.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z) - Text Generation with Speech Synthesis for ASR Data Augmentation [17.348764629839636]
We explore text augmentation for Automatic Speech Recognition (ASR) using large-scale pre-trained neural networks.
We find that neural models achieve 9%-15% relative WER improvement and outperform traditional methods.
arXiv Detail & Related papers (2023-05-22T18:45:20Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English.
For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated.
We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z) - Distribution augmentation for low-resource expressive text-to-speech [18.553812159109253]
This paper presents a novel data augmentation technique for text-to-speech (TTS)
It allows to generate new (text, audio) training examples without requiring any additional data.
arXiv Detail & Related papers (2022-02-13T21:19:31Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.