Toward Fairness in Speech Recognition: Discovery and mitigation of
performance disparities
- URL: http://arxiv.org/abs/2207.11345v1
- Date: Fri, 22 Jul 2022 21:33:29 GMT
- Title: Toward Fairness in Speech Recognition: Discovery and mitigation of
performance disparities
- Authors: Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian
King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke
- Abstract summary: We report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system.
For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts.
- Score: 10.917512121301135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As for other forms of AI, speech recognition has recently been examined with
respect to performance disparities across different user cohorts. One approach
to achieve fairness in speech recognition is to (1) identify speaker cohorts
that suffer from subpar performance and (2) apply fairness mitigation measures
targeting the cohorts discovered. In this paper, we report on initial findings
with both discovery and mitigation of performance disparities using data from a
product-scale AI assistant speech recognition system. We compare cohort
discovery based on geographic and demographic information to a more scalable
method that groups speakers without human labels, using speaker embedding
technology. For fairness mitigation, we find that oversampling of
underrepresented cohorts, as well as modeling speaker cohort membership by
additional input variables, reduces the gap between top- and bottom-performing
cohorts, without deteriorating overall recognition accuracy.
Related papers
- DENOASR: Debiasing ASRs through Selective Denoising [5.544079217915537]
We introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups.
We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance.
arXiv Detail & Related papers (2024-10-22T05:39:24Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - A New Benchmark of Aphasia Speech Recognition and Detection Based on
E-Branchformer and Multi-task Learning [29.916793641951507]
This paper presents a new benchmark for Aphasia speech recognition using state-of-the-art speech recognition techniques.
We introduce two multi-task learning methods based on the CTC/Attention architecture to perform both tasks simultaneously.
Our system achieves state-of-the-art speaker-level detection accuracy (97.3%), and a relative WER reduction of 11% for moderate Aphasia patients.
arXiv Detail & Related papers (2023-05-19T15:10:36Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Self-supervised Speaker Recognition Training Using Human-Machine
Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices.
We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity.
Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Improving Fairness in Speaker Recognition [4.94706680113206]
We investigate the disparity in performance achieved by state-of-the-art deep speaker recognition systems.
We show that models trained with demographically-balanced training sets exhibit a fairer behavior on different groups, while still being accurate.
arXiv Detail & Related papers (2021-04-29T01:08:53Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.