Silent versus modal multi-speaker speech recognition from ultrasound and
video
- URL: http://arxiv.org/abs/2103.00333v1
- Date: Sat, 27 Feb 2021 21:34:48 GMT
- Title: Silent versus modal multi-speaker speech recognition from ultrasound and
video
- Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals
- Abstract summary: We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips.
We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech.
We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing.
- Score: 43.919073642794324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate multi-speaker speech recognition from ultrasound images of the
tongue and video images of the lips. We train our systems on imaging data from
modal speech, and evaluate on matched test sets of two speaking modes: silent
and modal speech. We observe that silent speech recognition from imaging data
underperforms compared to modal speech recognition, likely due to a
speaking-mode mismatch between training and testing. We improve silent speech
recognition performance using techniques that address the domain mismatch, such
as fMLLR and unsupervised model adaptation. We also analyse the properties of
silent and modal speech in terms of utterance duration and the size of the
articulatory space. To estimate the articulatory space, we compute the convex
hull of tongue splines, extracted from ultrasound tongue images. Overall, we
observe that the duration of silent speech is longer than that of modal speech,
and that silent speech covers a smaller articulatory space than modal speech.
Although these two properties are statistically significant across speaking
modes, they do not directly correlate with word error rates from speech
recognition.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving the Gap in Visual Speech Recognition Between Normal and Silent
Speech Based on Metric Learning [11.50011780498048]
This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR)
We propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes.
Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
arXiv Detail & Related papers (2023-05-23T16:20:46Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Speaker Extraction with Co-Speech Gestures Cue [79.91394239104908]
We explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction.
We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker.
The experimental results show that the co-speech gestures cue is informative in associating the target speaker, and the quality of the extracted speech shows significant improvements over the unprocessed mixture speech.
arXiv Detail & Related papers (2022-03-31T06:48:52Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech [4.384576489684272]
We propose a novel approach to real-time sequence labeling in speech.
Our model treats speech and its own textual representation as two separate modalities or views.
We show significant gains of jointly learning from the two modalities when compared to text or audio only.
arXiv Detail & Related papers (2020-05-02T12:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.