Related papers: Articulatory Feature Prediction from Surface EMG during Speech Production

Articulatory Feature Prediction from Surface EMG during Speech Production

URL: http://arxiv.org/abs/2505.13814v2
Date: Thu, 29 May 2025 03:59:36 GMT
Title: Articulatory Feature Prediction from Surface EMG during Speech Production
Authors: Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn Schuller, Louis Goldstein, Shrikanth Narayanan,
Abstract summary: We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production.<n>The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features.<n>We demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms.
Score: 25.10685431811405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

Related papers

MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction [7.233654849867492]
We introduce MiSTR, a deep-learning framework that integrates temporal, spectral, and neurophysiological representations of iEEG signals.<n> evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility.
arXiv Detail & Related papers (2025-08-05T07:12:52Z)
Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture [0.0]
This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
arXiv Detail & Related papers (2025-04-25T05:57:22Z)
NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention [47.8479647938849]
We present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations.
arXiv Detail & Related papers (2024-09-04T07:33:01Z)
Beam Prediction based on Large Language Models [51.45077318268427]
We formulate the millimeter wave (mmWave) beam prediction problem as a time series forecasting task.<n>We transform historical observations into text-based representations using a trainable tokenizer.<n>Our method harnesses the power of LLMs to predict future optimal beams.
arXiv Detail & Related papers (2024-08-16T12:40:01Z)
Toward Fully-End-to-End Listened Speech Decoding from EEG Signals [29.548052495254257]
We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. The proposed method consists of an EEG module and a speech module along with a connector. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding.
arXiv Detail & Related papers (2024-06-12T21:08:12Z)
Topology of surface electromyogram signals: hand gesture decoding on Riemannian manifolds [0.0]
We present data and methods for decoding hand gestures using surface electromyogram (EMG) signals.<n>EMG-based upper limb interfaces are valuable for amputee rehabilitation, artificial supernumerary limb augmentation, gestural control of computers, and virtual and augmented reality applications.
arXiv Detail & Related papers (2023-11-14T21:20:54Z)
Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
An Improved Model for Voicing Silent Speech [42.75251355374594]
We present an improved model for voicing silent speech, where audio is synthesized from facial electromyography (EMG) signals. Our model uses convolutional layers to extract features from the signals and Transformer layers to propagate information across longer distances.
arXiv Detail & Related papers (2021-06-03T15:33:23Z)
End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms. We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
Correlation based Multi-phasal models for improved imagined speech EEG recognition [22.196642357767338]
This work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units. A bi-phase common representation learning module using neural networks is designed to model the correlation and between an analysis phase and a support phase. The proposed approach further handles the non-availability of multi-phasal data during decoding.
arXiv Detail & Related papers (2020-11-04T09:39:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.