Contextual RNN-T For Open Domain ASR
- URL: http://arxiv.org/abs/2006.03411v2
- Date: Wed, 12 Aug 2020 23:17:13 GMT
- Title: Contextual RNN-T For Open Domain ASR
- Authors: Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian
Metze, Yatharth Saraf
- Abstract summary: End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
- Score: 41.83409885125617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN
Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual
components of a traditional hybrid ASR system - acoustic model, language model,
pronunciation model - into a single neural network. While this has some nice
advantages, it limits the system to be trained using only paired audio and
text. Because of this, E2E models tend to have difficulties with correctly
recognizing rare words that are not frequently seen during training, such as
entity names. In this paper, we propose modifications to the RNN-T model that
allow the model to utilize additional metadata text with the objective of
improving performance on these named entity words. We evaluate our approach on
an in-house dataset sampled from de-identified public social media videos,
which represent an open domain ASR task. By using an attention model and a
biasing model to leverage the contextual metadata that accompanies a video, we
observe a relative improvement of about 16% in Word Error Rate on Named
Entities (WER-NE) for videos with related metadata.
Related papers
- Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Integrating Text Inputs For Training and Adapting RNN Transducer ASR
Models [29.256853083988634]
We propose a novel text representation and training framework for E2E ASR models.
We show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data.
arXiv Detail & Related papers (2022-02-26T15:03:09Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network [0.0]
We show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data.
We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER.
arXiv Detail & Related papers (2021-04-22T15:21:41Z) - End-to-end Named Entity Recognition from English Speech [51.22888702264816]
We introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimize the ASR and NER tagger components.
We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
arXiv Detail & Related papers (2020-05-22T13:39:14Z) - Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.