A Study of Gender Impact in Self-supervised Models for Speech-to-Text
Systems
- URL: http://arxiv.org/abs/2204.01397v1
- Date: Mon, 4 Apr 2022 11:28:19 GMT
- Title: A Study of Gender Impact in Self-supervised Models for Speech-to-Text
Systems
- Authors: Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko, Yannick
Est\`eve
- Abstract summary: We train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in pre-training data.
We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system.
- Score: 25.468558523679363
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised models for speech processing emerged recently as popular
foundation blocks in speech processing pipelines. These models are pre-trained
on unlabeled audio data and then used in speech processing downstream tasks
such as automatic speech recognition (ASR) or speech translation (ST). Since
these models are now used in research and industrial systems alike, it becomes
necessary to understand the impact caused by some features such as gender
distribution within pre-training data. Using French as our investigation
language, we train and compare gender-specific wav2vec 2.0 models against
models containing different degrees of gender balance in their pre-training
data. The comparison is performed by applying these models to two
speech-to-text downstream tasks: ASR and ST. Our results show that the type of
downstream integration matters. We observe lower overall performance using
gender-specific pre-training before fine-tuning an end-to-end ASR system.
However, when self-supervised models are used as feature extractors, the
overall ASR and ST results follow more complex patterns, in which the balanced
pre-trained model is not necessarily the best option. Lastly, our crude
'fairness' metric, the relative performance difference measured between female
and male test sets, does not display a strong variation from balanced to
gender-specific pre-trained wav2vec 2.0 models.
Related papers
- Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z) - Analyzing Robustness of End-to-End Neural Models for Automatic Speech
Recognition [11.489161072526677]
We investigate robustness properties of pre-trained neural models for automatic speech recognition.
In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets.
arXiv Detail & Related papers (2022-08-17T20:00:54Z) - Improving Low-Resource Speech Recognition with Pretrained Speech Models:
Continued Pretraining vs. Semi-Supervised Training [6.523198497365586]
Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR)
We show CoPT results in word error rates (WERs) equal to or slightly better than using semi-supervised training (SST)
In addition, we show that using the CoPT model for pseudo-labeling, and using these labels in SST, results in further improvements in WER.
arXiv Detail & Related papers (2022-07-01T21:02:51Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - ASR4REAL: An extended benchmark for speech models [19.348785785921446]
We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models.
We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent.
All tested models show a strong performance drop when tested on conversational speech.
arXiv Detail & Related papers (2021-10-16T14:34:25Z) - Improving Gender Fairness of Pre-Trained Language Models without
Catastrophic Forgetting [88.83117372793737]
Forgetting information in the original training data may damage the model's downstream performance by a large margin.
We propose GEnder Equality Prompt (GEEP) to improve gender fairness of pre-trained models with less forgetting.
arXiv Detail & Related papers (2021-10-11T15:52:16Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.