Do self-supervised speech and language models extract similar
representations as human brain?
- URL: http://arxiv.org/abs/2310.04645v2
- Date: Wed, 31 Jan 2024 09:54:43 GMT
- Title: Do self-supervised speech and language models extract similar
representations as human brain?
- Authors: Peili Chen, Linyang He, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li
- Abstract summary: Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
- Score: 2.390915090736061
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech and language models trained through self-supervised learning (SSL)
demonstrate strong alignment with brain activity during speech and language
perception. However, given their distinct training modalities, it remains
unclear whether they correlate with the same neural aspects. We directly
address this question by evaluating the brain prediction performance of two
representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and
language tasks. Our findings reveal that both models accurately predict speech
responses in the auditory cortex, with a significant correlation between their
brain predictions. Notably, shared speech contextual information between
Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain
activity, surpassing static semantic and lower-level acoustic-phonetic
information. These results underscore the convergence of speech contextual
representations in SSL models and their alignment with the neural network
underlying speech perception, offering valuable insights into both SSL models
and the neural basis of speech and language processing.
Related papers
- Towards Unified Neural Decoding of Perceived, Spoken and Imagined Speech from EEG Signals [1.33134751838052]
This research investigated the effectiveness of deep learning models for non-invasive neural signal decoding.
It focused on distinguishing between different speech paradigms, including perceived, overt, whispered, and imagined speech.
arXiv Detail & Related papers (2024-11-14T07:20:08Z) - Improving semantic understanding in speech language models via brain-tuning [19.732593005537606]
Speech language models align with human brain responses to natural language to an impressive degree.
Current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics.
We address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings.
arXiv Detail & Related papers (2024-10-11T20:06:21Z) - What Do Self-Supervised Speech and Speaker Models Learn? New Findings
From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations.
Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z) - Speech language models lack important brain-relevant semantics [6.626540321463248]
Recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree.
This poses the question of what types of information language models truly predict in the brain.
arXiv Detail & Related papers (2023-11-08T13:11:48Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Neural Language Models are not Born Equal to Fit Brain Data, but
Training Helps [75.84770193489639]
We examine the impact of test loss, training corpus and model architecture on the prediction of functional Magnetic Resonance Imaging timecourses of participants listening to an audiobook.
We find that untrained versions of each model already explain significant amount of signal in the brain by capturing similarity in brain responses across identical words.
We suggest good practices for future studies aiming at explaining the human language system using neural language models.
arXiv Detail & Related papers (2022-07-07T15:37:17Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.