Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
- URL: http://arxiv.org/abs/2410.16726v1
- Date: Tue, 22 Oct 2024 06:25:16 GMT
- Title: Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
- Authors: Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen,
- Abstract summary: We propose a cost-effective and practical approach to enhancing automatic speech recognition (ASR) performance using text-to-speech (TTS) models.
Experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements.
We study factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work.
- Score: 46.607944227556
- License:
- Abstract: While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. Comprehensive experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements, proving that the proposed method of enhancing low-resource ASR through a versatile TTS model is highly effective and has broad application prospects. Furthermore, we delve deeper into key characteristics of synthesized speech data that contribute to ASR improvement, examining factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work. We hope our findings provide helpful guidance and reference for the practical application of TTS-based data augmentation and push the advancement of low-resource ASR one step further.
Related papers
- Selective Attention Merging for low resource tasks: A case study of Child ASR [14.178224954581069]
Speech Foundation Models (SFMs) excel in various speech tasks, but their performance for low-resource tasks is hampered by limited pretraining data.
This paper introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors to enhance SFM performance on low-resource tasks.
Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques.
arXiv Detail & Related papers (2025-01-14T22:27:48Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.
This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey [2.716339075963185]
Recent advancements in deep learning (DL) have posed a significant challenge for automatic speech recognition (ASR)
ASR relies on extensive training datasets, including confidential ones, and demands substantial computational and storage resources.
Advanced DL techniques like deep transfer learning (DTL), federated learning (FL), and reinforcement learning (RL) address these issues.
arXiv Detail & Related papers (2024-03-02T16:25:42Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - Text Generation with Speech Synthesis for ASR Data Augmentation [17.348764629839636]
We explore text augmentation for Automatic Speech Recognition (ASR) using large-scale pre-trained neural networks.
We find that neural models achieve 9%-15% relative WER improvement and outperform traditional methods.
arXiv Detail & Related papers (2023-05-22T18:45:20Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.