Related papers: WER We Stand: Benchmarking Urdu ASR Models

WER We Stand: Benchmarking Urdu ASR Models

URL: http://arxiv.org/abs/2409.11252v2
Date: Mon, 4 Nov 2024 07:27:37 GMT
Title: WER We Stand: Benchmarking Urdu ASR Models
Authors: Samee Arif, Sualeha Farid, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar,
Abstract summary: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER) We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset.
Score: 3.5001789247699535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.

Related papers

PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems [0.0]
This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions.<n>We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases.<n>Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children's speech, and specific linguistic challenges.
arXiv Detail & Related papers (2025-05-27T14:14:55Z)
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy. Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z)
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS [0.9092013845117769]
IndicVoices-R (IV-R) is the largest multilingual Indian TTS dataset derived from an ASR dataset. IV-R matches the quality of gold-standard TTS datasets like LJ,Speech LibriTTS, and IndicTTS. We release the first TTS model for all 22 official Indian languages.
arXiv Detail & Related papers (2024-09-09T06:28:47Z)
Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments. We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z)
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z)
Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages [0.6193838300896449]
We release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.
arXiv Detail & Related papers (2022-06-01T18:22:01Z)
Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS) ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z)
Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
WER we are and WER we think we are [11.819335591315316]
We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
arXiv Detail & Related papers (2020-10-07T14:20:31Z)
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.