emg2speech: synthesizing speech from electromyography using self-supervised speech models
- URL: http://arxiv.org/abs/2510.23969v1
- Date: Tue, 28 Oct 2025 00:50:15 GMT
- Title: emg2speech: synthesizing speech from electromyography using self-supervised speech models
- Authors: Harshavardhana T. Gowda, Lee M. Miller,
- Abstract summary: We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio.<n>We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
Related papers
- Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z) - HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [90.74616208952791]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z) - Articulatory Feature Prediction from Surface EMG during Speech Production [25.10685431811405]
We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production.<n>The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features.<n>We demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms.
arXiv Detail & Related papers (2025-05-20T01:50:05Z) - Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography [0.0]
We present data and methods for decoding speech articulations using surface electromyogram (EMG) signals.<n>EMG-based speech neuroprostheses offer a promising approach for restoring audible speech in individuals who have lost the ability to speak intelligibly.
arXiv Detail & Related papers (2024-11-04T20:31:22Z) - DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation.
DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens.
We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Toward Fully-End-to-End Listened Speech Decoding from EEG Signals [29.548052495254257]
We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals.
The proposed method consists of an EEG module and a speech module along with a connector.
A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding.
arXiv Detail & Related papers (2024-06-12T21:08:12Z) - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures.
We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z) - Topology of surface electromyogram signals: hand gesture decoding on Riemannian manifolds [0.0]
We present data and methods for decoding hand gestures using surface electromyogram (EMG) signals.<n>EMG-based upper limb interfaces are valuable for amputee rehabilitation, artificial supernumerary limb augmentation, gestural control of computers, and virtual and augmented reality applications.
arXiv Detail & Related papers (2023-11-14T21:20:54Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - An Improved Model for Voicing Silent Speech [42.75251355374594]
We present an improved model for voicing silent speech, where audio is synthesized from facial electromyography (EMG) signals.
Our model uses convolutional layers to extract features from the signals and Transformer layers to propagate information across longer distances.
arXiv Detail & Related papers (2021-06-03T15:33:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.