Related papers: Decoding inner speech with an end-to-end brain-to-text neural interface

Decoding inner speech with an end-to-end brain-to-text neural interface

URL: http://arxiv.org/abs/2511.21740v1
Date: Fri, 21 Nov 2025 21:25:54 GMT
Title: Decoding inner speech with an end-to-end brain-to-text neural interface
Authors: Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski,
Abstract summary: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text.<n>Here, we introduce an end-to-end Brain-to-Text framework that translates neural activity into coherent sentences using a single differentiable neural network.
Score: 33.17572163528015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Related papers

Brain-language fusion enables interactive neural readout and in-silico experimentation [0.8805057433368938]
CorText is a framework that integrates neural activity directly into the latent space of an large language model.<n>It generates accurate image captions and can answer more detailed questions better than controls, while having access to neural data only.<n>These advances mark a shift from passive decoding toward generative, flexible interfaces between brain activity and language.
arXiv Detail & Related papers (2025-09-28T15:35:25Z)
sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment [8.466223794246261]
We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model.<n>We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset.
arXiv Detail & Related papers (2025-04-20T03:01:42Z)
Explanations of Large Language Models Explain Language Representations in the Brain [5.7916055414970895]
We propose a novel approach using explainable AI (XAI) to strengthen link between language processing and brain neural activity.<n>Applying attribution methods, we quantify the influence of preceding words on predictions.<n>We find stronger attributions suggest brain alignment for assessing the biological explanation methods.
arXiv Detail & Related papers (2025-02-20T16:05:45Z)
Brain-to-Text Benchmark '24: Lessons Learned [30.41641771704316]
Speech brain-computer interfaces aim to decipher what a person is trying to say from neural activity alone.<n>The Brain-to-Text Benchmark '24 foster the advancement of decoding algorithms that convert neural activity to text.<n>The benchmark will remain open indefinitely to support further work towards increasing the accuracy of brain-to-text algorithms.
arXiv Detail & Related papers (2024-12-23T02:44:35Z)
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation [48.20672677492805]
Current EEG/MEG-to-text decoding systems suffer from three key limitations.<n>BrainECHO is a multi-stage framework that employs decoupled representation learning.<n>BrainECHO demonstrates robustness across sentence, session, and subject-independent conditions.
arXiv Detail & Related papers (2024-10-19T04:29:03Z)
Language Reconstruction with Brain Predictive Coding from fMRI Data [28.217967547268216]
Theory of predictive coding suggests that human brain naturally engages in continuously predicting future word representations. textscPredFT achieves current state-of-the-art decoding performance with a maximum BLEU-1 score of $27.8%$.
arXiv Detail & Related papers (2024-05-19T16:06:02Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Sequential Best-Arm Identification with Application to Brain-Computer Interface [34.87975833920409]
A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. An electroencephalogram (EEG) and event-related potential (ERP)-based speller system is a type of BCI that allows users to spell words without using a physical keyboard. We propose a sequential top-two Thompson sampling (STTS) algorithm under the fixed-confidence setting and the fixed-budget setting.
arXiv Detail & Related papers (2023-05-17T18:49:44Z)
Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings. Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z)
Toward a realistic model of speech processing in the brain with self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate. We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot Sentiment Classification [78.120927891455]
State-of-the-art brain-to-text systems have achieved great success in decoding language directly from brain signals using neural networks. In this paper, we extend the problem to open vocabulary Electroencephalography(EEG)-To-Text Sequence-To-Sequence decoding and zero-shot sentence sentiment classification on natural reading tasks. Our model achieves a 40.1% BLEU-1 score on EEG-To-Text decoding and a 55.6% F1 score on zero-shot EEG-based ternary sentiment classification, which significantly outperforms supervised baselines.
arXiv Detail & Related papers (2021-12-05T21:57:22Z)
Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network. We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.