Related papers: Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

URL: http://arxiv.org/abs/2406.08644v1
Date: Wed, 12 Jun 2024 21:08:12 GMT
Title: Toward Fully-End-to-End Listened Speech Decoding from EEG Signals
Authors: Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Kadiri, Shrikanth Narayanan,
Abstract summary: We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. The proposed method consists of an EEG module and a speech module along with a connector. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding.
Score: 29.548052495254257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The proposed method consists of an EEG module and a speech module along with a connector. The EEG module learns to better represent EEG signals, while the speech module generates speech waveforms from model representations. The connector learns to bridge the distributions of the latent spaces of EEG and speech. The proposed framework is both simple and efficient, by allowing single-step inference, and outperforms prior works on objective metrics. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding. The source code is available here: github.com/lee-jhwn/fesde.

Related papers

TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction [7.795259968001983]
AAD based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction.<n>We propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech.<n>Our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics.
arXiv Detail & Related papers (2025-10-14T08:26:50Z)
WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities [55.00677513249723]
EEG signals simultaneously encode both cognitive processes and intrinsic neural states.<n>We map EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation.<n>The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations.
arXiv Detail & Related papers (2025-09-26T06:21:51Z)
Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer [9.914613096064848]
Decoding speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping. In this study, we developed a large-scale multi-utterance speech EEG from 57 right-handed native English-speaking subjects. Our results reveal distinct speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions.
arXiv Detail & Related papers (2025-04-02T10:38:08Z)
Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction [36.38186261968484]
We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals. We use an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences.
arXiv Detail & Related papers (2025-01-08T21:11:35Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition. The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG [17.96977778655143]
We propose a novel method for decoding EEG signals for imagined speech using DDPMs and a conditional autoencoder named Diff-E. Results indicate that Diff-E significantly improves the accuracy of decoding EEG signals for imagined speech compared to traditional machine learning techniques and baseline models.
arXiv Detail & Related papers (2023-07-26T07:12:39Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions. We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z)
Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis [16.93803259128475]
Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
arXiv Detail & Related papers (2022-05-03T22:05:35Z)
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.