ASR4REAL: An extended benchmark for speech models
- URL: http://arxiv.org/abs/2110.08583v1
- Date: Sat, 16 Oct 2021 14:34:25 GMT
- Title: ASR4REAL: An extended benchmark for speech models
- Authors: Morgane Riviere, Jade Copet, Gabriel Synnaeve
- Abstract summary: We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models.
We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent.
All tested models show a strong performance drop when tested on conversational speech.
- Score: 19.348785785921446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Popular ASR benchmarks such as Librispeech and Switchboard are limited in the
diversity of settings and speakers they represent. We introduce a set of
benchmarks matching real-life conditions, aimed at spotting possible biases and
weaknesses in models. We have found out that even though recent models do not
seem to exhibit a gender bias, they usually show important performance
discrepancies by accent, and even more important ones depending on the
socio-economic status of the speakers. Finally, all tested models show a strong
performance drop when tested on conversational speech, and in this precise
context even a language model trained on a dataset as big as Common Crawl does
not seem to have significant positive effect which reiterates the importance of
developing conversational language models
Related papers
- ASR Benchmarking: Need for a More Representative Conversational Dataset [3.017953715883516]
We introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults.
Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings.
arXiv Detail & Related papers (2024-09-18T15:03:04Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Robustness of Speech Separation Models for Similar-pitch Speakers [14.941946672578863]
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.
This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
arXiv Detail & Related papers (2024-07-22T15:55:08Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - A Study of Gender Impact in Self-supervised Models for Speech-to-Text
Systems [25.468558523679363]
We train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in pre-training data.
We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system.
arXiv Detail & Related papers (2022-04-04T11:28:19Z) - A Systematic Investigation of Commonsense Understanding in Large
Language Models [23.430757316504316]
Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting.
We ask whether these models exhibit commonsense understanding by evaluating models against four commonsense benchmarks.
arXiv Detail & Related papers (2021-10-31T22:20:36Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.