Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs
- URL: http://arxiv.org/abs/2506.00304v1
- Date: Fri, 30 May 2025 23:22:44 GMT
- Title: Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs
- Authors: Payal Mohapatra, Akash Pandey, Xiaoyuan Zhang, Qi Zhu,
- Abstract summary: Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech.<n>Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech.<n>We propose a novel EMG adaptor module that maps EMG features into an LLM's input space, achieving an average word error rate (WER) of 0.49 on a closed-vocabulary unvoiced EMG-to-text task.
- Score: 4.201963244739168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech. However, most prior methods rely on paired voiced and unvoiced EMG signals, along with speech data, for EMG-to-text conversion, which is not practical for such individuals. Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech. To this end, we address the challenge of learning from unvoiced EMG alone and propose a novel EMG adaptor module that maps EMG features into an LLM's input space, achieving an average word error rate (WER) of 0.49 on a closed-vocabulary unvoiced EMG-to-text task. Even with a conservative data availability of just six minutes, our approach improves performance over specialized models by nearly 20%. While LLMs have been shown to be extendable to new language modalities -- such as audio -- understanding articulatory biosignals like unvoiced EMG remains more challenging. This work takes a crucial first step toward enabling LLMs to comprehend unvoiced speech using surface EMG.
Related papers
- Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning [13.113505050543298]
We introduce a large language model capable of processing speech inputs.<n>We show that tuning it further with reinforcement learning on human preference enables it to adapt better to disordered speech than traditional fine-tuning.
arXiv Detail & Related papers (2024-12-25T00:16:22Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Prompting Large Language Models with Speech Recognition Abilities [31.77576008965215]
We extend the capabilities of large language models by directly attaching a small audio encoder allowing it to perform speech recognition.
Experiments on MultilingualSpeech show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18%.
arXiv Detail & Related papers (2023-07-21T08:39:15Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM [19.36630667212398]
We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation.
Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis.
Our method surpasses existing spoken language models in speaker preservation and semantic coherence.
arXiv Detail & Related papers (2023-05-24T15:39:43Z) - MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation [27.19320167337675]
We propose a technique to learn a robust speech encoder in a self-supervised fashion only on the speech side.
This technique termed Masked Acoustic Modeling (MAM) not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals.
In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
arXiv Detail & Related papers (2020-10-22T05:02:06Z) - Digital Voicing of Silent Speech [48.15708685020142]
We consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements.
We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals.
Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data.
arXiv Detail & Related papers (2020-10-06T18:23:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.