Improving the Gap in Visual Speech Recognition Between Normal and Silent
Speech Based on Metric Learning
- URL: http://arxiv.org/abs/2305.14203v2
- Date: Mon, 16 Oct 2023 05:06:22 GMT
- Title: Improving the Gap in Visual Speech Recognition Between Normal and Silent
Speech Based on Metric Learning
- Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima
- Abstract summary: This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR)
We propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes.
Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
- Score: 11.50011780498048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel metric learning approach to address the
performance gap between normal and silent speech in visual speech recognition
(VSR). The difference in lip movements between the two poses a challenge for
existing VSR models, which exhibit degraded accuracy when applied to silent
speech. To solve this issue and tackle the scarcity of training data for silent
speech, we propose to leverage the shared literal content between normal and
silent speech and present a metric learning approach based on visemes.
Specifically, we aim to map the input of two speech types close to each other
in a latent space if they have similar viseme representations. By minimizing
the Kullback-Leibler divergence of the predicted viseme probability
distributions between and within the two speech types, our model effectively
learns and predicts viseme identities. Our evaluation demonstrates that our
method improves the accuracy of silent VSR, even when limited training data is
available.
Related papers
- Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR)
We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors.
Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Silent versus modal multi-speaker speech recognition from ultrasound and
video [43.919073642794324]
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips.
We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech.
We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing.
arXiv Detail & Related papers (2021-02-27T21:34:48Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.