Explanations for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2302.14062v1
- Date: Mon, 27 Feb 2023 11:09:19 GMT
- Title: Explanations for Automatic Speech Recognition
- Authors: Xiaoliang Wu, Peter Bell, Ajitha Rajan
- Abstract summary: We provide an explanation for an ASR transcription as a subset of audio frames.
We adapt existing explainable AI techniques from image classification-Statistical Fault Localisation(SFL) and Causal.
We evaluate the quality of the explanations generated by the proposed techniques over three different ASR,Google API, the baseline model of Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
- Score: 9.810810252231812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address quality assessment for neural network based ASR by providing
explanations that help increase our understanding of the system and ultimately
help build trust in the system. Compared to simple classification labels,
explaining transcriptions is more challenging as judging their correctness is
not straightforward and transcriptions as a variable-length sequence is not
handled by existing interpretable machine learning models. We provide an
explanation for an ASR transcription as a subset of audio frames that is both a
minimal and sufficient cause of the transcription. To do this, we adapt
existing explainable AI (XAI) techniques from image classification-Statistical
Fault Localisation(SFL) and Causal. Additionally, we use an adapted version of
Local Interpretable Model-Agnostic Explanations (LIME) for ASR as a baseline in
our experiments. We evaluate the quality of the explanations generated by the
proposed techniques over three different ASR ,Google API, the baseline model of
Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
Related papers
- It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme
Recognition [9.810810252231812]
Interest in using XAI techniques to explain deep learning-based automatic speech recognition (ASR) is emerging.
We adapt a state-of-the-art XAI technique from the image classification domain, Local Interpretable Model-Agnostic Explanations (LIME) to a model trained for a TIMIT-based phoneme recognition task.
We find a variant of LIME based on time partitioned audio segments, that we propose in this paper, produces the most reliable explanations.
arXiv Detail & Related papers (2023-05-29T11:04:13Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z) - WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal
Classification Paradigm [0.0]
We propose a new balanced paradigm for e-WER in a classification setting.
Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER.
The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.
arXiv Detail & Related papers (2021-01-14T07:26:28Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.