Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer
- URL: http://arxiv.org/abs/2504.03762v1
- Date: Wed, 02 Apr 2025 10:38:08 GMT
- Title: Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer
- Authors: Muyun Jiang, Yi Ding, Wei Zhang, Kok Ann Colin Teo, LaiGuan Fong, Shuailei Zhang, Zhiwei Guo, Chenyu Liu, Raghavan Bhuvanakantham, Wei Khang Jeremy Sim, Chuan Huat Vince Foo, Rong Hui Jonathan Chua, Parasuraman Padmanabhan, Victoria Leong, Jia Lu, Balazs Gulyas, Cuntai Guan,
- Abstract summary: Decoding speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping.<n>In this study, we developed a large-scale multi-utterance speech EEG from 57 right-handed native English-speaking subjects.<n>Our results reveal distinct speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions.
- Score: 9.914613096064848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at https://github.com/Jiang-Muyun/FAST
Related papers
- sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment [8.466223794246261]
We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model.
We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset.
arXiv Detail & Related papers (2025-04-20T03:01:42Z) - Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction [36.38186261968484]
We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals.<n>We use an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences.
arXiv Detail & Related papers (2025-01-08T21:11:35Z) - BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation [29.78480739360263]
We propose a new multi-stage strategy for semantic brain signal decoding via vEctor-quantized speCtrogram reconstruction.
BrainECHO successively conducts: 1) autoencoding of the audio spectrogram; 2) Brain-audio latent space alignment; and 3) Semantic text generation via Whisper finetuning.
BrainECHO outperforms state-of-the-art methods under the same data split settings on two widely accepted resources.
arXiv Detail & Related papers (2024-10-19T04:29:03Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings.
Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in
Real-Time MRI [9.614694312155798]
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production.
The proposed framework comprises of convolutions, recurrent network, and connectionist temporal classification loss, trained entirely end-to-end.
To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's arttory motions captured by rtMRI video.
arXiv Detail & Related papers (2021-06-16T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.