Two Stage Contextual Word Filtering for Context bias in Unified
Streaming and Non-streaming Transducer
- URL: http://arxiv.org/abs/2301.06735v3
- Date: Thu, 8 Jun 2023 13:29:38 GMT
- Title: Two Stage Contextual Word Filtering for Context bias in Unified
Streaming and Non-streaming Transducer
- Authors: Zhanheng Yang, Sining Sun, Xiong Wang, Yike Zhang, Long Ma, Lei Xie
- Abstract summary: It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data.
We propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model.
- Score: 17.835882045443896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is difficult for an E2E ASR system to recognize words such as entities
appearing infrequently in the training data. A widely used method to mitigate
this issue is feeding contextual information into the acoustic model. Previous
works have proven that a compact and accurate contextual list can boost the
performance significantly. In this paper, we propose an efficient approach to
obtain a high quality contextual list for a unified streaming/non-streaming
based E2E model. Specifically, we make use of the phone-level streaming output
to first filter the predefined contextual word list then fuse it into
non-casual encoder and decoder to generate the final recognition results. Our
approach improve the accuracy of the contextual ASR system and speed up the
inference process. Experiments on two datasets demonstrates over 20% CER
reduction comparing to the baseline system. Meanwhile, the RTF of our system
can be stabilized within 0.15 when the size of the contextual word list grows
over 6,000.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Written Term Detection Improves Spoken Term Detection [9.961529254621432]
We propose a multitask training objective which allows unpaired text to be integrated into E2E KWS without complicating indexing and search.
In addition to training an E2E KWS model to retrieve text queries from spoken documents, we jointly train it to retrieve text queries from masked written documents.
We show that this approach can effectively leverage unpaired text for KWS, with significant improvements in search performance across a wide variety of languages.
arXiv Detail & Related papers (2024-07-05T15:50:47Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - CIF-based Collaborative Decoding for End-to-end Contextual Speech
Recognition [14.815422751109061]
We propose a continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion.
An extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution.
Our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline.
arXiv Detail & Related papers (2020-12-17T09:40:11Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.