Towards measuring fairness in speech recognition: Fair-Speech dataset
- URL: http://arxiv.org/abs/2408.12734v1
- Date: Thu, 22 Aug 2024 20:55:17 GMT
- Title: Towards measuring fairness in speech recognition: Fair-Speech dataset
- Authors: Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer,
- Abstract summary: This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information.
Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands.
- Score: 14.703638352216132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current public datasets for speech recognition (ASR) tend not to focus specifically on the fairness aspect, such as performance across different demographic groups. This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands. We also provide ASR baselines, including on models trained on transcribed and untranscribed social media videos and open source models.
Related papers
- Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants [10.227469020901232]
This paper introduces the Sonos Voice Control Bias Assessment dataset.
1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts.
Results show statistically significant differences in performance across age, dialectal region and ethnicity.
arXiv Detail & Related papers (2024-05-14T12:53:32Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - The Casual Conversations v2 Dataset [6.439761523935614]
The dataset includes 26,467 videos of 5,567 unique paid participants, with an average of almost 5 videos per person.
The participants agreed for their data to be used in assessing fairness of AI models and provided self-reported age, gender, language/dialect, disability status, physical adornments, physical attributes and geo-location information.
arXiv Detail & Related papers (2023-03-08T19:17:05Z) - ASR Bundestag: A Large-Scale political debate dataset in German [0.0]
We present ASR Bundestag, a dataset for automatic speech recognition in German.
The dataset consists of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning.
arXiv Detail & Related papers (2023-02-12T21:45:18Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Personalized Automatic Speech Recognition Trained on Small Disordered
Speech Datasets [0.0]
We trained personalized models for 195 individuals with different types and severities of speech impairment.
For the home automation scenario, 79% of speakers reached the target WER with 18-20 minutes of speech; but even with only 3-4 minutes of speech, 63% of speakers reached the target WER.
arXiv Detail & Related papers (2021-10-09T17:11:17Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.