Related papers: Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning

Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning

URL: http://arxiv.org/abs/2512.04618v1
Date: Thu, 04 Dec 2025 09:47:15 GMT
Title: Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning
Authors: Mohamed Baha Ben Ticha, Xingchen Ran, Guillaume Saldanha, Gaël Le Godais, Philémon Roussel, Marc Aubert, Amina Fontanell, Thomas Costecalde, Lucas Struber, Serpil Karakas, Shaomin Zhang, Philippe Kahane, Guillaume Charvet, Stéphan Chabardès, Blaise Yvert,
Abstract summary: Speech Brain Computer Interfaces offer promising solutions to people with severe paralysis unable to communicate.<n>Recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings.<n>We present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning.
Score: 1.58476321728042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech Brain Computer Interfaces (BCIs) offer promising solutions to people with severe paralysis unable to communicate. A number of recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings by predicting a series of phonemes or words and using downstream language models to obtain meaningful sentences. A current challenge is to reconstruct speech in a streaming mode by directly regressing cortical signals into acoustic speech. While this has been achieved recently using intracortical data, further work is needed to obtain comparable results with surface ECoG recordings. In particular, optimizing neural decoders becomes critical in this case. Here we present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning to enhance the direct regression of speech from ECoG signals. The approach is evaluated on two datasets, one obtained with clinical subdural electrodes in an epileptic patient, and another obtained with the fully implantable WIMAGINE epidural system in a participant of a motor BCI trial. To our knowledge this presents a first attempt to decode speech from a fully implantable and wireless epidural recording system offering perspectives for long-term use.

Related papers

MEGState: Phoneme Decoding from Magnetoencephalography Signals [15.480040965084214]
We introduce MEGState, a novel architecture for phoneme decoding from MEG signals.<n>MeGState captures fine-grained cortical responses evoked by auditory stimuli.<n>These findings highlight the potential of MEG-based phoneme decoding as a scalable pathway toward non-invasive brain-computer interfaces for speech.
arXiv Detail & Related papers (2025-12-19T13:02:31Z)
Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication [45.424817836500175]
This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes.<n>We leverage phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals.<n>Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences.
arXiv Detail & Related papers (2025-10-31T07:31:13Z)
sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment [8.466223794246261]
We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model.<n>We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset.
arXiv Detail & Related papers (2025-04-20T03:01:42Z)
Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer [9.914613096064848]
Decoding speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping.<n>In this study, we developed a large-scale multi-utterance speech EEG from 57 right-handed native English-speaking subjects.<n>Our results reveal distinct speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions.
arXiv Detail & Related papers (2025-04-02T10:38:08Z)
Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction [36.38186261968484]
We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals.<n>We use an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences.
arXiv Detail & Related papers (2025-01-08T21:11:35Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z)
Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings. Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z)
Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework [1.623136488969658]
Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria. Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface.
arXiv Detail & Related papers (2021-11-02T09:43:21Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.