Teaching Wav2Vec2 the Language of the Brain
- URL: http://arxiv.org/abs/2501.09459v1
- Date: Thu, 16 Jan 2025 10:37:07 GMT
- Title: Teaching Wav2Vec2 the Language of the Brain
- Authors: Tobias Fiedler, Leon Hermann, Florian Müller, Sarel Cohen, Peter Chin, Tobias Friedrich, Eilon Vaadia,
- Abstract summary: We show that patterns learned by Wav2Vec2 are transferable to brain data.
We execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures.
- Score: 13.094509587996082
- License:
- Abstract: The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.
Related papers
- Brain-to-Text Benchmark '24: Lessons Learned [30.41641771704316]
Speech brain-computer interfaces aim to decipher what a person is trying to say from neural activity alone.
The Brain-to-Text Benchmark '24 foster the advancement of decoding algorithms that convert neural activity to text.
The benchmark will remain open indefinitely to support further work towards increasing the accuracy of brain-to-text algorithms.
arXiv Detail & Related papers (2024-12-23T02:44:35Z) - Improving Speech Decoding from ECoG with Self-Supervised Pretraining [0.0]
We reengineering a self-supervised, fully convolutional model that learns latent representations of audio using a noise-contrastive loss.
We train this model on unlabelled electrocorticographic (ECoG) recordings.
We then use it to transform ECoG from labeled speech sessions into wav2vec's representation space, before finally training a supervised encoder-decoder to map these representations to text.
arXiv Detail & Related papers (2024-05-28T22:48:53Z) - AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings.
Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z) - Deep Feature Learning for Medical Acoustics [78.56998585396421]
The purpose of this paper is to compare different learnables in medical acoustics tasks.
A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies.
arXiv Detail & Related papers (2022-08-05T10:39:37Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Don't stop the training: continuously-updating self-supervised
algorithms best account for auditory responses in the cortex [1.7725414095035827]
We analyze the brain responses of two ferret auditory cortices recorded with functional UltraSound imaging (fUS)
We compare these brain responses to the activations of Wav2vec 2.0, a self-supervised neural network pretrained with 960,h of speech.
arXiv Detail & Related papers (2022-02-15T10:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.