Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection
- URL: http://arxiv.org/abs/2201.03313v1
- Date: Mon, 10 Jan 2022 12:29:55 GMT
- Title: Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection
- Authors: Jing Du, Shiliang Pu, Qinbo Dong, Chao Jin, Xin Qi, Dian Gu, Ru Wu,
Hongwei Zhou
- Abstract summary: We propose a cross-modal post-processing system for speech recognizers.
It fuses acoustic features and textual features from different modalities.
It joints a confidence estimator and an error corrector in multi-task learning fashion.
- Score: 25.940199825317073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although modern automatic speech recognition (ASR) systems can achieve high
performance, they may produce errors that weaken readers' experience and do
harm to downstream tasks. To improve the accuracy and reliability of ASR
hypotheses, we propose a cross-modal post-processing system for speech
recognizers, which 1) fuses acoustic features and textual features from
different modalities, 2) joints a confidence estimator and an error corrector
in multi-task learning fashion and 3) unifies error correction and utterance
rejection modules. Compared with single-modal or single-task models, our
proposed system is proved to be more effective and efficient. Experiment result
shows that our post-processing system leads to more than 10% relative reduction
of character error rate (CER) for both single-speaker and multi-speaker speech
on our industrial ASR system, with about 1.7ms latency for each token, which
ensures that extra latency introduced by post-processing is acceptable in
streaming speech recognition.
Related papers
- Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR
Error Correction [0.9502148118198473]
We propose PATCorrect, a novel non-autoregressive (NAR) approach to reduce word error rate (WER)
We demonstrate that PATCorrect consistently outperforms state-of-the-art NAR method on English corpus across different upstream ASR systems.
arXiv Detail & Related papers (2023-02-10T04:05:24Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Multi-task Language Modeling for Improving Speech Recognition of Rare
Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.