End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system
- URL: http://arxiv.org/abs/2202.09003v1
- Date: Fri, 18 Feb 2022 03:26:02 GMT
- Title: End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system
- Authors: Zhengyi Zhang, Pan Zhou
- Abstract summary: End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
- Score: 61.148549738631814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) speech recognition architectures assemble all components of
traditional speech recognition system into a single model. Although it
simplifies ASR system, it introduces contextual ASR drawback: the E2E model has
worse performance on utterances containing infrequent proper nouns. In this
work, we propose to add a contextual bias attention (CBA) module to attention
based encoder decoder (AED) model to improve its ability of recognizing the
contextual phrases. Specifically, CBA utilizes the context vector of source
attention in decoder to attend to a specific bias embedding. Jointly learned
with the basic AED parameters, CBA can tell the model when and where to bias
its output probability distribution. At inference stage, a list of bias phrases
is preloaded and we adapt the posterior distributions of both CTC and attention
decoder according to the attended bias phrase of CBA. We evaluate the proposed
method on GigaSpeech and achieve a consistent relative improvement on recall
rate of bias phrases ranging from 15% to 28% compared to the baseline model.
Meanwhile, our method shows a strong anti-bias ability as the performance on
general tests only degrades 1.7% even 2,000 bias phrases are present.
Related papers
- XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition [9.03519622415822]
This study introduces a Cross-lingual Contextual Biasing(XCB) module.
We augment a pre-trained ASR model for the dominant language by integrating an auxiliary language biasing module and a language-specific loss.
Experimental results conducted on our in-house code-switching dataset have validated the efficacy of our approach.
arXiv Detail & Related papers (2024-08-20T04:00:19Z) - Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Improving Contextual Spelling Correction by External Acoustics Attention
and Semantic Aware Data Augmentation [31.408074817254732]
We propose an improved non-autoregressive spelling correction model for contextual biasing in E2E neural transducer-based ASR systems.
We incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases.
Experiments show that the improved method outperforms the baseline ASR+Biasing system by as much as 20.3% relative name recall gain.
arXiv Detail & Related papers (2023-02-22T08:00:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Towards Contextual Spelling Correction for Customization of End-to-end
Speech Recognition Systems [27.483603895258437]
We introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.
We propose filtering algorithms to handle large-size context lists, and performance balancing mechanisms to control the biasing degree of the model.
Experiments show that the proposed method achieves as much as 51% relative word error rate (WER) reduction over ASR system and outperforms traditional biasing methods.
arXiv Detail & Related papers (2022-03-02T06:00:48Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - CIF-based Collaborative Decoding for End-to-end Contextual Speech
Recognition [14.815422751109061]
We propose a continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion.
An extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution.
Our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline.
arXiv Detail & Related papers (2020-12-17T09:40:11Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.