Related papers: An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

URL: http://arxiv.org/abs/2503.08954v1
Date: Tue, 11 Mar 2025 23:09:06 GMT
Title: An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR
Authors: Sewade Ogun, Vincent Colotte, Emmanuel Vincent,
Abstract summary: Augmenting the training data of automatic speech recognition systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years.<n>We leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models.
Score: 12.197936305117407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11\% relative on Common Voice and by up to 35\% relative on LibriSpeech compared to training on real data only.

Related papers

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap [46.607944227556]
We propose a cost-effective and practical approach to enhancing automatic speech recognition (ASR) performance using text-to-speech (TTS) models. Experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements. We study factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work.
arXiv Detail & Related papers (2024-10-22T06:25:16Z)
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z)
Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model. The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z)
Towards Selection of Text-to-speech Data to Augment ASR Training [20.115236045164355]
We train a neural network to measure the similarity of a synthetic data to real speech. We find that incorporating synthetic samples with considerable dissimilarity to real speech is crucial for boosting recognition performance.
arXiv Detail & Related papers (2023-05-30T17:24:28Z)
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation [19.807274303199755]
We propose a novel data augmentation method that combines pitch-shifting and VC techniques. Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models. Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.
arXiv Detail & Related papers (2022-04-21T11:03:37Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)
MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR) We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.