WER we are and WER we think we are
- URL: http://arxiv.org/abs/2010.03432v1
- Date: Wed, 7 Oct 2020 14:20:31 GMT
- Title: WER we are and WER we think we are
- Authors: Piotr Szyma\'nski, Piotr \.Zelasko, Mikolaj Morzy, Adrian Szymczak,
Marzena \.Zy{\l}a-Hoppe, Joanna Banaszczak, Lukasz Augustyniak, Jan Mizgajski
and Yishay Carmiel
- Abstract summary: We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets.
We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark.
We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
- Score: 11.819335591315316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing of conversational speech requires the
availability of high-quality transcripts. In this paper, we express our
skepticism towards the recent reports of very low Word Error Rates (WERs)
achieved by modern Automatic Speech Recognition (ASR) systems on benchmark
datasets. We outline several problems with popular benchmarks and compare three
state-of-the-art commercial ASR systems on an internal dataset of real-life
spontaneous human conversations and HUB'05 public benchmark. We show that WERs
are significantly higher than the best reported results. We formulate a set of
guidelines which may aid in the creation of real-life, multi-domain datasets
with high quality annotations for training and testing of robust ASR systems.
Related papers
- ASR Benchmarking: Need for a More Representative Conversational Dataset [3.017953715883516]
We introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults.
Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings.
arXiv Detail & Related papers (2024-09-18T15:03:04Z) - WER We Stand: Benchmarking Urdu ASR Models [3.5001789247699535]
This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models.
We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER)
We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset.
arXiv Detail & Related papers (2024-09-17T15:00:31Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision [0.20999222360659603]
This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
arXiv Detail & Related papers (2023-06-21T21:33:39Z) - NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning [0.20999222360659603]
NoRefER is a novel referenceless quality metric for automatic speech recognition (ASR) systems.
NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality.
The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing.
arXiv Detail & Related papers (2023-06-21T21:26:19Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.