Do self-supervised speech models develop human-like perception biases?
- URL: http://arxiv.org/abs/2205.15819v1
- Date: Tue, 31 May 2022 14:21:40 GMT
- Title: Do self-supervised speech models develop human-like perception biases?
- Authors: Juliette Millet, Ewan Dunbar
- Abstract summary: We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding ( CPC)
We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and HuBERT seem to develop a universal speech perception space which is not language specific.
A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.
- Score: 11.646802225841153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised models for speech processing form representational spaces
without using any external labels. Increasingly, they appear to be a feasible
way of at least partially eliminating costly manual annotations, a problem of
particular concern for low-resource languages. But what kind of
representational spaces do these models construct? Human perception specializes
to the sounds of listeners' native languages. Does the same thing happen in
self-supervised models? We examine the representational spaces of three kinds
of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive
predictive coding (CPC), and compare them with the perceptual spaces of
French-speaking and English-speaking human listeners, both globally and taking
account of the behavioural differences between the two language groups. We show
that the CPC model shows a small native language effect, but that wav2vec 2.0
and HuBERT seem to develop a universal speech perception space which is not
language specific. A comparison against the predictions of supervised phone
recognisers suggests that all three self-supervised models capture relatively
fine-grained perceptual phenomena, while supervised models are better at
capturing coarser, phone-level, effects of listeners' native language, on
perception.
Related papers
- Probing self-attention in self-supervised speech models for cross-linguistic differences [0.0]
We study the self-attention mechanisms of one small self-supervised speech transformer model (TERA)
We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language.
We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining.
arXiv Detail & Related papers (2024-09-04T22:47:33Z) - Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0 [0.11510009152620666]
We study how Wav2Vec2 resolves phonotactic constraints.
We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts.
Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds.
arXiv Detail & Related papers (2024-07-03T11:04:31Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Predicting non-native speech perception using the Perceptual
Assimilation Model and state-of-the-art acoustic models [9.858745856649998]
We present a new, open dataset of French- and English-speaking participants' speech perception behaviour for 61 vowel sounds.
We show that phoneme assimilation is a better predictor than fine-grained phonetic modelling, both for the discrimination behaviour as a whole.
We also show that wav2vec 2.0, while not good at capturing the effects of native language on speech perception, is complementary to information about native phoneme assimilation.
arXiv Detail & Related papers (2022-05-31T14:25:59Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - A Brief Overview of Unsupervised Neural Speech Representation Learning [12.850357461259197]
We review the development of unsupervised representation learning for speech over the last decade.
We identify two primary model categories: self-supervised methods and probabilistic latent variable models.
arXiv Detail & Related papers (2022-03-01T11:15:35Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.