"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services
- URL: http://arxiv.org/abs/2504.09346v1
- Date: Sat, 12 Apr 2025 21:31:22 GMT
- Title: "It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services
- Authors: Shira Michel, Sufi Kaur, Sarah Elizabeth Gillespie, Jeffrey Gleason, Christo Wilson, Avijit Ghosh,
- Abstract summary: This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach.<n>Our findings reveal technical performance disparities across five regional, English-language accents.<n>Current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination.
- Score: 3.8931913630405393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in artificial intelligence (AI) speech generation and voice cloning technologies have produced naturalistic speech and accurate voice replication, yet their influence on sociotechnical systems across diverse accents and linguistic traits is not fully understood. This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach using surveys and interviews to assess technical performance and uncover how users' lived experiences influence their perceptions of accent variations in these speech technologies. Our findings reveal technical performance disparities across five regional, English-language accents and demonstrate how current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination, potentially creating new forms of digital exclusion. Overall, our study highlights the need for inclusive design and regulation by providing actionable insights for developers, policymakers, and organizations to ensure equitable and socially responsible AI speech technologies.
Related papers
- Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements [12.716872085463887]
generative adversarial network (GAN)-based approaches have drawn considerable attention for their powerful feature-mapping capabilities and potential to produce highly realistic speech.
This systematic review presents a comprehensive analysis of the voice conversion landscape, highlighting key techniques, key challenges, and the transformative impact of GANs in the field.
Overall, this work serves as an essential resource for researchers, developers, and practitioners aiming to advance the state-of-the-art (SOTA) in voice conversion technology.
arXiv Detail & Related papers (2025-04-27T11:22:21Z) - Enhancing nonnative speech perception and production through an AI-powered application [0.0]
The aim of this study is to examine the impact of training with an AI-powered mobile application on nonnative sound perception and production.<n>The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice.<n>The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention.
arXiv Detail & Related papers (2025-03-18T10:05:12Z) - Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond [0.0]
We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations.<n>We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation.
arXiv Detail & Related papers (2025-02-06T10:33:07Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - State-of-the-art in speaker recognition [0.0]
Recent advances in speech technologies have produced new tools to improve speaker recognition.
Speaker recognition is far away from being a technology where all the possibilities have already been explored.
arXiv Detail & Related papers (2022-02-23T11:49:09Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.