Related papers: Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition

Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition

URL: http://arxiv.org/abs/2406.02925v3
Date: Sat, 05 Oct 2024 09:06:11 GMT
Title: Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition
Authors: Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee,
Abstract summary: We show that task vector arithmetic is effective at mitigating the synthetic-to-real gap in speech recognition. Our proposed method, SYN2REAL, shows an average improvement of 10.03% improvement in word error rate over baselines.
Score: 44.914084799875866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic data is widely used in speech recognition due to the availability of text-to-speech models, which facilitate adapting models to previously unseen text domains. However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task vector arithmetic is effective at mitigating this gap. Our proposed method, SYN2REAL task vector, shows an average improvement of 10.03\% improvement in word error rate over baselines on the SLURP dataset. Additionally, we show that an average of SYN2REAL task vectors, when we have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.

Related papers

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training. We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS. Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation [67.98338382984556]
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
arXiv Detail & Related papers (2023-09-04T08:52:59Z)
Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR. Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z)
STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z)
A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data [1.14219428942199]
We propose a simple baseline technique for domain adaptation in end-to-end speech recognition models. We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine. We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates.
arXiv Detail & Related papers (2022-06-22T12:07:38Z)
Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition [18.924716098922683]
Machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. We propose two novel techniques during training to mitigate the problems due to the distribution gap. We show that these methods significantly improve the training of speech recognition models using synthetic data.
arXiv Detail & Related papers (2021-10-21T21:11:42Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.