Self-supervised models of audio effectively explain human cortical
responses to speech
- URL: http://arxiv.org/abs/2205.14252v1
- Date: Fri, 27 May 2022 22:04:02 GMT
- Title: Self-supervised models of audio effectively explain human cortical
responses to speech
- Authors: Aditya R. Vaidya, Shailee Jain, Alexander G. Huth
- Abstract summary: We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
- Score: 71.57870452667369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised language models are very effective at predicting high-level
cortical responses during language comprehension. However, the best current
models of lower-level auditory processing in the human brain rely on either
hand-constructed acoustic filters or representations from supervised audio
neural networks. In this work, we capitalize on the progress of self-supervised
speech representation learning (SSL) to create new state-of-the-art models of
the human auditory system. Compared against acoustic baselines, phonemic
features, and supervised models, representations from the middle layers of
self-supervised models (APC, wav2vec, wav2vec 2.0, and HuBERT) consistently
yield the best prediction performance for fMRI recordings within the auditory
cortex (AC). Brain areas involved in low-level auditory processing exhibit a
preference for earlier SSL model layers, whereas higher-level semantic areas
prefer later layers. We show that these trends are due to the models' ability
to encode information at multiple linguistic levels (acoustic, phonetic, and
lexical) along their representation depth. Overall, these results show that
self-supervised models effectively capture the hierarchy of information
relevant to different stages of speech processing in human cortex.
Related papers
- Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration [17.94683764469626]
We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
arXiv Detail & Related papers (2023-06-09T20:07:22Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - DDKtor: Automatic Diadochokinetic Speech Analysis [13.68342426889044]
This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech.
Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems.
The LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
arXiv Detail & Related papers (2022-06-29T13:34:03Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.