Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and
Self-training of Neural Transducer
- URL: http://arxiv.org/abs/2207.14736v1
- Date: Fri, 29 Jul 2022 15:14:03 GMT
- Title: Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and
Self-training of Neural Transducer
- Authors: Cong-Thanh Do, Mohan Li, and Rama Doddipatla
- Abstract summary: This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data.
For the fine-tuning task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
- Score: 20.8850874806462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a new approach to perform unsupervised fine-tuning and
self-training using unlabeled speech data for recurrent neural network
(RNN)-Transducer (RNN-T) end-to-end (E2E) automatic speech recognition (ASR)
systems. Conventional systems perform fine-tuning/self-training using ASR
hypothesis as the targets when using unlabeled audio data and are susceptible
to the ASR performance of the base model. Here in order to alleviate the
influence of ASR errors while using unlabeled data, we propose a
multiple-hypothesis RNN-T loss that incorporates multiple ASR 1-best hypotheses
into the loss function. For the fine-tuning task, ASR experiments on
Librispeech show that the multiple-hypothesis approach achieves a relative
reduction of 14.2% word error rate (WER) when compared to the single-hypothesis
approach, on the test_other set. For the self-training task, ASR models are
trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along
with CHiME-4 real noisy data as unlabeled data. The multiple-hypothesis
approach yields a relative reduction of 3.3% WER on the CHiME-4's
single-channel real noisy evaluation set when compared with the
single-hypothesis approach.
Related papers
- Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC)
We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon.
We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z) - UCorrect: An Unsupervised Framework for Automatic Speech Recognition
Error Correction [18.97378605403447]
We propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction.
Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect.
arXiv Detail & Related papers (2024-01-11T06:30:07Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Error Correction in ASR using Sequence-to-Sequence Models [32.41875780785648]
Post-editing in Automatic Speech Recognition entails automatically correcting common and systematic errors produced by the ASR system.
We propose to use a powerful pre-trained sequence-to-sequence model, BART, to serve as a denoising model.
Experimental results on accented speech data demonstrate that our strategy effectively rectifies a significant number of ASR errors.
arXiv Detail & Related papers (2022-02-02T17:32:59Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Unsupervised Domain Adaptation for Speech Recognition via Uncertainty
Driven Self-Training [55.824641135682725]
Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD show that up to 80% of the performance of a system trained on ground-truth data can be recovered.
arXiv Detail & Related papers (2020-11-26T18:51:26Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.