ASR Error Detection via Audio-Transcript entailment
- URL: http://arxiv.org/abs/2207.10849v1
- Date: Fri, 22 Jul 2022 02:47:15 GMT
- Title: ASR Error Detection via Audio-Transcript entailment
- Authors: Nimshi Venkat Meripo, Sandeep Konam
- Abstract summary: We propose an end-to-end approach for ASR error detection using audio-transcript entailment.
The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively.
Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
- Score: 1.3750624267664155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite improved performances of the latest Automatic Speech Recognition
(ASR) systems, transcription errors are still unavoidable. These errors can
have a considerable impact in critical domains such as healthcare, when used to
help with clinical documentation. Therefore, detecting ASR errors is a critical
first step in preventing further error propagation to downstream applications.
To this end, we propose a novel end-to-end approach for ASR error detection
using audio-transcript entailment. To the best of our knowledge, we are the
first to frame this problem as an end-to-end entailment task between the audio
segment and its corresponding transcript segment. Our intuition is that there
should be a bidirectional entailment between audio and transcript when there is
no recognition error and vice versa. The proposed model utilizes an acoustic
encoder and a linguistic encoder to model the speech and transcript
respectively. The encoded representations of both modalities are fused to
predict the entailment. Since doctor-patient conversations are used in our
experiments, a particular emphasis is placed on medical terms. Our proposed
model achieves classification error rates (CER) of 26.2% on all transcription
errors and 23% on medical errors specifically, leading to improvements upon a
strong baseline by 12% and 15.4%, respectively.
Related papers
- A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - Improving Audio Caption Fluency with Automatic Error Correction [23.157732462075547]
We propose a new task of AAC error correction for post-processing AAC outputs.
We use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation.
We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs.
arXiv Detail & Related papers (2023-06-16T13:37:01Z) - SoftCorrect: Error Correction with Soft Detection for Automatic Speech
Recognition [116.31926128970585]
We propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection.
Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect.
Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively.
arXiv Detail & Related papers (2022-12-02T09:11:32Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - FastCorrect 2: Fast Error Correction on Multiple Candidates for
Automatic Speech Recognition [92.12910821300034]
We propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy.
FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline.
arXiv Detail & Related papers (2021-09-29T13:48:03Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.