Contextual Adapters for Personalized Speech Recognition in Neural
Transducers
- URL: http://arxiv.org/abs/2205.13660v1
- Date: Thu, 26 May 2022 22:46:28 GMT
- Title: Contextual Adapters for Personalized Speech Recognition in Neural
Transducers
- Authors: Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing
Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann
- Abstract summary: We propose training neural contextual adapters for personalization in neural transducer based ASR models.
Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models.
- Score: 16.628830937429388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personal rare word recognition in end-to-end Automatic Speech Recognition
(E2E ASR) models is a challenge due to the lack of training data. A standard
way to address this issue is with shallow fusion methods at inference time.
However, due to their dependence on external language models and the
deterministic approach to weight boosting, their performance is limited. In
this paper, we propose training neural contextual adapters for personalization
in neural transducer based ASR models. Our approach can not only bias towards
user-defined words, but also has the flexibility to work with pretrained ASR
models. Using an in-house dataset, we demonstrate that contextual adapters can
be applied to any general purpose pretrained ASR model to improve
personalization. Our method outperforms shallow fusion, while retaining
functionality of the pretrained models by not altering any of the model
weights. We further show that the adapter style training is superior to
full-fine-tuning of the ASR models on datasets with user-defined content.
Related papers
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Training dynamic models using early exits for automatic speech
recognition on resource-constrained devices [15.879328412777008]
Early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands.
We show that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models.
Results provide insights into the training dynamics of early-exit architectures for ASR models.
arXiv Detail & Related papers (2023-09-18T07:45:16Z) - Adapting an Unadaptable ASR System [40.402050390096456]
We consider the recently released OpenAI Whisper ASR as an example of a large-scale ASR system to assess adaptation methods.
An error correction based approach is adopted, as this does not require access to the model.
The generalization ability of the system in two distinct dimensions are then evaluated.
arXiv Detail & Related papers (2023-06-01T23:54:11Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - A Light-weight contextual spelling correction model for customizing
transducer-based speech recognition systems [42.05399301143457]
We introduce a light-weight contextual spelling correction model to correct context-related recognition errors.
Experiments show that the model improves baseline ASR model performance with about 50% relative word error rate reduction.
The model also shows excellent performance for out-of-vocabulary terms not seen during training.
arXiv Detail & Related papers (2021-08-17T08:14:37Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.