Related papers: Contextual RNN-T For Open Domain ASR

Contextual RNN-T For Open Domain ASR

URL: http://arxiv.org/abs/2006.03411v2
Date: Wed, 12 Aug 2020 23:17:13 GMT
Title: Contextual RNN-T For Open Domain ASR
Authors: Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf
Abstract summary: End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network. This has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
Score: 41.83409885125617
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

Related papers

Customizing Speech Recognition Model with Large Language Model Feedback [5.290365603660415]
We propose a reinforcement learning based approach for unsupervised domain adaptation.<n>We leverage unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch.<n>Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.
arXiv Detail & Related papers (2025-06-05T18:42:57Z)
Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose a self-supervised continual learning approach for Automatic Speech Recognition. We use a memory-enhanced ASR model from the literature to decode new words from the slides. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements. All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models [29.256853083988634]
We propose a novel text representation and training framework for E2E ASR models. We show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data.
arXiv Detail & Related papers (2022-02-26T15:03:09Z)
Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition. The t-SOT model has the advantages of less inference cost and a simpler model architecture. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z)
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network [0.0]
We show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data. We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER.
arXiv Detail & Related papers (2021-04-22T15:21:41Z)
End-to-end Named Entity Recognition from English Speech [51.22888702264816]
We introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimize the ASR and NER tagger components. We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
arXiv Detail & Related papers (2020-05-22T13:39:14Z)
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.