Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition
- URL: http://arxiv.org/abs/2301.02736v1
- Date: Fri, 6 Jan 2023 22:32:50 GMT
- Title: Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition
- Authors: David M. Chan, Shalini Ghosh, Ariya Rastrow, Bj\"orn Hoffmeister
- Abstract summary: We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
- Score: 19.489794740679024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite improvements to the generalization performance of automated speech
recognition (ASR) models, specializing ASR models for downstream tasks remains
a challenging task, primarily due to reduced data availability (necessitating
increased data collection), and rapidly shifting data distributions (requiring
more frequent model fine-tuning). In this work, we investigate the potential of
leveraging external knowledge, particularly through off-policy key-value stores
generated with text-to-speech methods, to allow for flexible post-training
adaptation to new data distributions. In our approach, audio embeddings
captured from text-to-speech, along with semantic text embeddings, are used to
bias ASR via an approximate k-nearest-neighbor (KNN) based attentive fusion
step. Our experiments on LibiriSpeech and in-house voice assistant/search
datasets show that the proposed approach can reduce domain adaptation time by
up to 1K GPU-hours while providing up to 3% WER improvement compared to a
fine-tuning baseline, suggesting a promising approach for adapting production
ASR systems in challenging zero and few-shot scenarios.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Extending Whisper with prompt tuning to target-speaker ASR [18.31992429200396]
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from overlapped utterances.
Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model.
This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR.
arXiv Detail & Related papers (2023-12-13T11:49:16Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems [15.527854608553824]
ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems.
An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon.
Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
arXiv Detail & Related papers (2021-02-17T02:27:09Z) - Robust Prediction of Punctuation and Truecasing for Medical ASR [18.08508027663331]
This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
arXiv Detail & Related papers (2020-07-04T07:15:13Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.