Improving RNN-T ASR Performance with Date-Time and Location Awareness
- URL: http://arxiv.org/abs/2106.06183v1
- Date: Fri, 11 Jun 2021 05:57:30 GMT
- Title: Improving RNN-T ASR Performance with Date-Time and Location Awareness
- Authors: Swayambhu Nath Ray, Soumyajit Mitra, Raghavendra Bilgi, Sri Garimella
- Abstract summary: We show that contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline.
On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others.
Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.
- Score: 6.308539010172309
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we explore the benefits of incorporating context into a
Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model
to improve the speech recognition for virtual assistants. Specifically, we use
meta information extracted from the time at which the utterance is spoken and
the approximate location information to make ASR context aware. We show that
these contextual information, when used individually, improves overall
performance by as much as 3.48% relative to the baseline and when the contexts
are combined, the model learns complementary features and the recognition
improves by 4.62%. On specific domains, these contextual signals show
improvements as high as 11.5%, without any significant degradation on others.
We ran experiments with models trained on data of sizes 30K hours and 10K
hours. We show that the scale of improvement with the 10K hours dataset is much
higher than the one obtained with 30K hours dataset. Our results indicate that
with limited data to train the ASR model, contextual signals can improve the
performance significantly.
Related papers
- Anatomy of Industrial Scale Multilingual ASR [13.491861238522421]
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system.
Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages.
arXiv Detail & Related papers (2024-04-15T14:48:43Z) - Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping [1.7593130415737603]
This paper presents an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data.
We generate pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model.
The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models.
arXiv Detail & Related papers (2024-04-10T20:40:24Z) - BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English.
For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated.
We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Data Augmenting Contrastive Learning of Speech Representations in the
Time Domain [92.50459322938528]
We introduce WavAugment, a time-domain data augmentation library.
We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC.
We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.
arXiv Detail & Related papers (2020-07-02T09:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.