Related papers: Measuring the Accuracy of Automatic Speech Recognition Solutions

Measuring the Accuracy of Automatic Speech Recognition Solutions

URL: http://arxiv.org/abs/2408.16287v1
Date: Thu, 29 Aug 2024 06:38:55 GMT
Title: Measuring the Accuracy of Automatic Speech Recognition Solutions
Authors: Korbinian Kuhn, Verena Kersken, Benedikt Reuter, Niklas Egger, Gottfried Zimmermann,
Abstract summary: Automatic Speech Recognition (ASR) is now a part of many popular applications. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events.
Score: 4.99320937849508
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.

Related papers

Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction.
arXiv Detail & Related papers (2024-11-01T19:11:54Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance [7.882996636086014]
It is important that automatic speech recognition (ASR) models and their use is fair and equitable. The current study seeks to understand the factors underlying this disparity by examining the performance of the current state-of-the-art neural network based ASR system.
arXiv Detail & Related papers (2024-07-19T02:14:17Z)
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora. We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z)
You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish [0.5249805590164903]
We focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.
arXiv Detail & Related papers (2024-05-22T06:24:55Z)
Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Knowledge Distillation for Improved Accuracy in Spoken Question Answering [63.72278693825945]
We devise a training strategy to perform knowledge distillation from spoken documents and written counterparts. Our work makes a step towards distilling knowledge from the language model as a supervision signal. Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset.
arXiv Detail & Related papers (2020-10-21T15:18:01Z)
WER we are and WER we think we are [11.819335591315316]
We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
arXiv Detail & Related papers (2020-10-07T14:20:31Z)
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.