The Perceptimatic English Benchmark for Speech Perception Models
- URL: http://arxiv.org/abs/2005.03418v1
- Date: Thu, 7 May 2020 12:35:44 GMT
- Title: The Perceptimatic English Benchmark for Speech Perception Models
- Authors: Juliette Millet and Ewan Dunbar
- Abstract summary: The benchmark consists of ABX stimuli along with the responses of 91 American English-speaking listeners.
We show that DeepSpeech, a standard English speech recognizer, is more specialized on English phoneme discrimination than English listeners.
- Score: 11.646802225841153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Perceptimatic English Benchmark, an open experimental
benchmark for evaluating quantitative models of speech perception in English.
The benchmark consists of ABX stimuli along with the responses of 91 American
English-speaking listeners. The stimuli test discrimination of a large number
of English and French phonemic contrasts. They are extracted directly from
corpora of read speech, making them appropriate for evaluating statistical
acoustic models (such as those used in automatic speech recognition) trained on
typical speech data sets. We show that phone discrimination is correlated with
several types of models, and give recommendations for researchers seeking
easily calculated norms of acoustic distance on experimental stimuli. We show
that DeepSpeech, a standard English speech recognizer, is more specialized on
English phoneme discrimination than English listeners, and is poorly correlated
with their behaviour, even though it yields a low error on the decision task
given to humans.
Related papers
- Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - DDSupport: Language Learning Support System that Displays Differences
and Distances from Model Speech [16.82591185507251]
We propose a new language learning support system that calculates speech scores and detects mispronunciations by beginners.
The proposed system uses deep learning--based speech processing to display the pronunciation score of the learner's speech and the difference/distance between the learner's and a group of models' pronunciation.
arXiv Detail & Related papers (2022-12-08T05:49:15Z) - Evaluation of Automated Speech Recognition Systems for Conversational
Speech: A Linguistic Perspective [0.0]
We take a linguistic perspective, and take the French language as a case study toward disambiguation of the French homophones.
Our contribution aims to provide more insight into human speech transcription accuracy in conditions to reproduce those of state-of-the-art ASR systems.
arXiv Detail & Related papers (2022-11-05T04:35:40Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - A study on native American English speech recognition by Indian
listeners with varying word familiarity level [62.14295630922855]
We have three kinds of responses from each listener while they recognize an utterance.
From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.
Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
arXiv Detail & Related papers (2021-12-08T07:43:38Z) - ASR4REAL: An extended benchmark for speech models [19.348785785921446]
We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models.
We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent.
All tested models show a strong performance drop when tested on conversational speech.
arXiv Detail & Related papers (2021-10-16T14:34:25Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Perceptimatic: A human speech perception benchmark for unsupervised
subword modelling [11.646802225841153]
We present a data set and methods to compare speech processing models and human behaviour on a phone discrimination task.
We provide Perceptimatic, an open data set which consists of French and English speech stimuli, as well as the results of 91 English- and 93 French-speaking listeners.
The stimuli test a wide range of French and English contrasts, and are extracted directly from corpora of natural running read speech.
We show that, unlike unsupervised models and supervised multilingual models, a standard supervised monolingual HMM-GMM phone recognition system, while good at discriminating phones, yields a representational space very
arXiv Detail & Related papers (2020-10-12T18:40:08Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.