Related papers: A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos

A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos

URL: http://arxiv.org/abs/2307.10587v1
Date: Thu, 20 Jul 2023 05:03:00 GMT
Title: A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos
Authors: Anand Kumar Rai, Siddharth D Jaiswal, Animesh Mukherjee
Abstract summary: We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
Score: 4.809236881780707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.

Related papers

BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition [0.5224038339798622]
We present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion.
arXiv Detail & Related papers (2025-04-30T14:08:14Z)
Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities [9.473861847584843]
We present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper. We investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups.
arXiv Detail & Related papers (2024-10-11T14:07:07Z)
LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases. We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor. We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants [10.227469020901232]
This paper introduces the Sonos Voice Control Bias Assessment dataset. 1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts. Results show statistically significant differences in performance across age, dialectal region and ethnicity.
arXiv Detail & Related papers (2024-05-14T12:53:32Z)
Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents. Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z)
Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect. dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z)
A study on native American English speech recognition by Indian listeners with varying word familiarity level [62.14295630922855]
We have three kinds of responses from each listener while they recognize an utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
arXiv Detail & Related papers (2021-12-08T07:43:38Z)
English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents. We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.