Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens
- URL: http://arxiv.org/abs/2508.07143v2
- Date: Wed, 13 Aug 2025 02:20:09 GMT
- Title: Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens
- Authors: Anna Seo Gyeong Choi, Hoon Choi,
- Abstract summary: We argue that systematic misrecognition of certain speech varieties constitutes more than a technical limitation.<n>We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns.
- Score: 0.42970700836450487
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation -- it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties ("temporal taxation"), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.
Related papers
- Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity [0.0]
Speech remains one of the most visible yet overlooked vectors of inclusion and exclusion in contemporary society.<n>This article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence.
arXiv Detail & Related papers (2026-01-26T16:12:25Z) - WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus [6.361208877327219]
This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems.<n>We analyze transcription accuracy across speakers from four ethnic backgrounds (African American, Caucasian American, ChicanX, and Yakama) and examine how sociophonetic variation contributes to differential system performance.
arXiv Detail & Related papers (2025-10-26T02:19:40Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - Cross-linguistic disagreement as a conflict of semantic alignment norms in multilingual AI~Linguistic Diversity as a Problem for Philosophy, Cognitive Science, and AI~ [0.2443066828522608]
Cross-linguistic consistency (CL-consistency) seeks universal concepts across languages.<n>Folk-consistency, which respects language-specific semantic norms.<n>Findings challenge assumption that universal representations and cross-linguistic transfer capabilities are inherently desirable.
arXiv Detail & Related papers (2025-03-01T03:31:40Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - A Theory of Unsupervised Speech Recognition [60.12287608968879]
Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition systems from unpaired speech-only and text-only corpora.
We propose a general theoretical framework to study the properties of ASR-U systems based on random matrix theory and the theory of neural tangent kernels.
arXiv Detail & Related papers (2023-06-09T08:12:27Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Hey ASR System! Why Aren't You More Inclusive? Automatic Speech
Recognition Systems' Bias and Proposed Bias Mitigation Techniques. A
Literature Review [0.0]
We present research that addresses ASR biases against gender, race, and the sick and disabled.
We also discuss techniques for designing a more accessible and inclusive ASR technology.
arXiv Detail & Related papers (2022-11-17T13:15:58Z) - Language technology practitioners as language managers: arbitrating data
bias and predictive bias in ASR [0.0]
We use the lens of language policy to analyse how current practices in training and testing ASR systems in industry lead to the data bias giving rise to these systematic error differences.
We propose a re-framing of language resources as (public) infrastructure which should not solely be designed for markets, but for, and with meaningful cooperation of, speech communities.
arXiv Detail & Related papers (2022-02-25T10:37:52Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Quantifying Bias in Automatic Speech Recognition [28.301997555189462]
This paper quantifies the bias of a Dutch SotA ASR system against gender, age, regional accents and non-native accents.
Based on our findings, we suggest bias mitigation strategies for ASR development.
arXiv Detail & Related papers (2021-03-28T12:52:03Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.