A Simple Baseline for Domain Adaptation in End to End ASR Systems Using
Synthetic Data
- URL: http://arxiv.org/abs/2206.13240v1
- Date: Wed, 22 Jun 2022 12:07:38 GMT
- Title: A Simple Baseline for Domain Adaptation in End to End ASR Systems Using
Synthetic Data
- Authors: Raviraj Joshi, Anupam Singh
- Abstract summary: We propose a simple baseline technique for domain adaptation in end-to-end speech recognition models.
We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine.
We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates.
- Score: 1.14219428942199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition(ASR) has been dominated by deep learning-based
end-to-end speech recognition models. These approaches require large amounts of
labeled data in the form of audio-text pairs. Moreover, these models are more
susceptible to domain shift as compared to traditional models. It is common
practice to train generic ASR models and then adapt them to target domains
using comparatively smaller data sets. We consider a more extreme case of
domain adaptation where text-only corpus is available. In this work, we propose
a simple baseline technique for domain adaptation in end-to-end speech
recognition models. We convert the text-only corpus to audio data using single
speaker Text to Speech (TTS) engine. The parallel data in the target domain is
then used to fine-tune the final dense layer of generic ASR models. We show
that single speaker synthetic TTS data coupled with final dense layer only
fine-tuning provides reasonable improvements in word error rates. We use text
data from address and e-commerce search domains to show the effectiveness of
our low-cost baseline approach on CTC and attention-based models.
Related papers
- Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition [44.914084799875866]
We show that task vector arithmetic is effective at mitigating the synthetic-to-real gap in speech recognition.
Our proposed method, SYN2REAL, shows an average improvement of 10.03% improvement in word error rate over baselines.
arXiv Detail & Related papers (2024-06-05T04:25:56Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Corpus Synthesis for Zero-shot ASR domain Adaptation using Large
Language Models [19.726699481313194]
We propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains.
Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28%$ on unseen target domains.
arXiv Detail & Related papers (2023-09-18T15:43:08Z) - Text-Only Domain Adaptation for End-to-End Speech Recognition through
Down-Sampling Acoustic Representation [67.98338382984556]
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains.
In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality.
Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
arXiv Detail & Related papers (2023-09-04T08:52:59Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Improving Data Driven Inverse Text Normalization using Data Augmentation [14.820077884045645]
Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form.
We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data.
We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
arXiv Detail & Related papers (2022-07-20T06:07:26Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.