Predicting non-native speech perception using the Perceptual
Assimilation Model and state-of-the-art acoustic models
- URL: http://arxiv.org/abs/2205.15823v1
- Date: Tue, 31 May 2022 14:25:59 GMT
- Title: Predicting non-native speech perception using the Perceptual
Assimilation Model and state-of-the-art acoustic models
- Authors: Juliette Millet, Ioana Chitoran, Ewan Dunbar
- Abstract summary: We present a new, open dataset of French- and English-speaking participants' speech perception behaviour for 61 vowel sounds.
We show that phoneme assimilation is a better predictor than fine-grained phonetic modelling, both for the discrimination behaviour as a whole.
We also show that wav2vec 2.0, while not good at capturing the effects of native language on speech perception, is complementary to information about native phoneme assimilation.
- Score: 9.858745856649998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our native language influences the way we perceive speech sounds, affecting
our ability to discriminate non-native sounds. We compare two ideas about the
influence of the native language on speech perception: the Perceptual
Assimilation Model, which appeals to a mental classification of sounds into
native phoneme categories, versus the idea that rich, fine-grained phonetic
representations tuned to the statistics of the native language, are sufficient.
We operationalize this idea using representations from two state-of-the-art
speech models, a Dirichlet process Gaussian mixture model and the more recent
wav2vec 2.0 model. We present a new, open dataset of French- and
English-speaking participants' speech perception behaviour for 61 vowel sounds
from six languages. We show that phoneme assimilation is a better predictor
than fine-grained phonetic modelling, both for the discrimination behaviour as
a whole, and for predicting differences in discriminability associated with
differences in native language background. We also show that wav2vec 2.0, while
not good at capturing the effects of native language on speech perception, is
complementary to information about native phoneme assimilation, and provides a
good model of low-level phonetic representations, supporting the idea that both
categorical and fine-grained perception are used during speech perception.
Related papers
- Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - Do self-supervised speech models develop human-like perception biases? [11.646802225841153]
We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding ( CPC)
We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and HuBERT seem to develop a universal speech perception space which is not language specific.
A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.
arXiv Detail & Related papers (2022-05-31T14:21:40Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - A phonetic model of non-native spoken word processing [40.018538874161756]
We train a computational model of phonetic learning, which has no access to phonology, on either one or two languages.
We first show that the model exhibits predictable behaviors on phone-level and word-level discrimination tasks.
We then test the model on a spoken word processing task, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers.
arXiv Detail & Related papers (2021-01-27T11:46:21Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Perceptimatic: A human speech perception benchmark for unsupervised
subword modelling [11.646802225841153]
We present a data set and methods to compare speech processing models and human behaviour on a phone discrimination task.
We provide Perceptimatic, an open data set which consists of French and English speech stimuli, as well as the results of 91 English- and 93 French-speaking listeners.
The stimuli test a wide range of French and English contrasts, and are extracted directly from corpora of natural running read speech.
We show that, unlike unsupervised models and supervised multilingual models, a standard supervised monolingual HMM-GMM phone recognition system, while good at discriminating phones, yields a representational space very
arXiv Detail & Related papers (2020-10-12T18:40:08Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.