Inductive biases, pretraining and fine-tuning jointly account for brain
responses to speech
- URL: http://arxiv.org/abs/2103.01032v1
- Date: Thu, 25 Feb 2021 19:11:55 GMT
- Title: Inductive biases, pretraining and fine-tuning jointly account for brain
responses to speech
- Authors: Juliette Millet, Jean-Remi King
- Abstract summary: We compare five types of deep neural networks to human brain responses elicited by spoken sentences.
The differences in brain-similarity across networks revealed three main results.
- Score: 6.87854783185243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our ability to comprehend speech remains, to date, unrivaled by deep learning
models. This feat could result from the brain's ability to fine-tune generic
sound representations for speech-specific processes. To test this hypothesis,
we compare i) five types of deep neural networks to ii) human brain responses
elicited by spoken sentences and recorded in 102 Dutch subjects using
functional Magnetic Resonance Imaging (fMRI). Each network was either trained
on an acoustics scene classification, a speech-to-text task (based on Bengali,
English, or Dutch), or not trained. The similarity between each model and the
brain is assessed by correlating their respective activations after an optimal
linear projection. The differences in brain-similarity across networks revealed
three main results. First, speech representations in the brain can be accounted
for by random deep networks. Second, learning to classify acoustic scenes leads
deep nets to increase their brain similarity. Third, learning to process
phonetically-related speech inputs (i.e., Dutch vs English) leads deep nets to
reach higher levels of brain-similarity than learning to process
phonetically-distant speech inputs (i.e. Dutch vs Bengali). Together, these
results suggest that the human brain fine-tunes its heavily-trained auditory
hierarchy to learn to process speech.
Related papers
- Improving semantic understanding in speech language models via brain-tuning [19.732593005537606]
Speech language models align with human brain responses to natural language to an impressive degree.
Current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics.
We address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings.
arXiv Detail & Related papers (2024-10-11T20:06:21Z) - SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - Towards Decoding Brain Activity During Passive Listening of Speech [0.0]
We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods.
This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech.
Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception.
arXiv Detail & Related papers (2024-02-26T20:04:01Z) - Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings.
Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z) - Neural Language Models are not Born Equal to Fit Brain Data, but
Training Helps [75.84770193489639]
We examine the impact of test loss, training corpus and model architecture on the prediction of functional Magnetic Resonance Imaging timecourses of participants listening to an audiobook.
We find that untrained versions of each model already explain significant amount of signal in the brain by capturing similarity in brain responses across identical words.
We suggest good practices for future studies aiming at explaining the human language system using neural language models.
arXiv Detail & Related papers (2022-07-07T15:37:17Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Model-based analysis of brain activity reveals the hierarchy of language
in 305 subjects [82.81964713263483]
A popular approach to decompose the neural bases of language consists in correlating, across individuals, the brain responses to different stimuli.
Here, we show that a model-based approach can reach equivalent results within subjects exposed to natural stimuli.
arXiv Detail & Related papers (2021-10-12T15:30:21Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.