Digital Voicing of Silent Speech
- URL: http://arxiv.org/abs/2010.02960v1
- Date: Tue, 6 Oct 2020 18:23:35 GMT
- Title: Digital Voicing of Silent Speech
- Authors: David Gaddy and Dan Klein
- Abstract summary: We consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements.
We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals.
Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data.
- Score: 48.15708685020142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the task of digitally voicing silent speech, where
silently mouthed words are converted to audible speech based on
electromyography (EMG) sensor measurements that capture muscle impulses. While
prior work has focused on training speech synthesis models from EMG collected
during vocalized speech, we are the first to train from EMG collected during
silently articulated speech. We introduce a method of training on silent EMG by
transferring audio targets from vocalized to silent signals. Our method greatly
improves intelligibility of audio generated from silent EMG compared to a
baseline that only trains with vocalized data, decreasing transcription word
error rate from 64% to 4% in one data condition and 88% to 68% in another. To
spur further development on this task, we share our new dataset of silent and
vocalized facial EMG measurements.
Related papers
- LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - An Improved Model for Voicing Silent Speech [42.75251355374594]
We present an improved model for voicing silent speech, where audio is synthesized from facial electromyography (EMG) signals.
Our model uses convolutional layers to extract features from the signals and Transformer layers to propagate information across longer distances.
arXiv Detail & Related papers (2021-06-03T15:33:23Z) - MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation [27.19320167337675]
We propose a technique to learn a robust speech encoder in a self-supervised fashion only on the speech side.
This technique termed Masked Acoustic Modeling (MAM) not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals.
In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
arXiv Detail & Related papers (2020-10-22T05:02:06Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - End-to-end Whispered Speech Recognition with Frequency-weighted
Approaches and Pseudo Whisper Pre-training [130.56878980058966]
We present several approaches for end-to-end (E2E) recognition of whispered speech.
We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus.
As long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.
arXiv Detail & Related papers (2020-05-05T07:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.