Leveraging Visemes for Better Visual Speech Representation and Lip
Reading
- URL: http://arxiv.org/abs/2307.10157v1
- Date: Wed, 19 Jul 2023 17:38:26 GMT
- Title: Leveraging Visemes for Better Visual Speech Representation and Lip
Reading
- Authors: Javad Peymanfard, Vahid Saeedi, Mohammad Reza Mohammadi, Hossein
Zeinali, Nasser Mozayani
- Abstract summary: We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
- Score: 2.7836084563851284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lip reading is a challenging task that has many potential applications in
speech recognition, human-computer interaction, and security systems. However,
existing lip reading systems often suffer from low accuracy due to the
limitations of video features. In this paper, we propose a novel approach that
leverages visemes, which are groups of phonetically similar lip shapes, to
extract more discriminative and robust video features for lip reading. We
evaluate our approach on various tasks, including word-level and sentence-level
lip reading, and audiovisual speech recognition using the Arman-AV dataset, a
largescale Persian corpus. Our experimental results show that our viseme based
approach consistently outperforms the state-of-theart methods in all these
tasks. The proposed method reduces the lip-reading word error rate (WER) by
9.1% relative to the best previous method.
Related papers
- Analysis of Visual Features for Continuous Lipreading in Spanish [0.0]
lipreading is a complex task whose objective is to interpret speech when audio is not available.
We propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish.
arXiv Detail & Related papers (2023-11-21T09:28:00Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Is Lip Region-of-Interest Sufficient for Lipreading? [24.294559985408192]
We propose to adopt the entire face for lipreading with self-supervised learning.
AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments.
arXiv Detail & Related papers (2022-05-28T01:34:24Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Learn an Effective Lip Reading Model without Pains [96.21025771586159]
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics.
Most existing methods obtained high performance by constructing a complex neural network.
We find that making proper use of these strategies could always bring exciting improvements without changing much of the model.
arXiv Detail & Related papers (2020-11-15T15:29:19Z) - Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
arXiv Detail & Related papers (2020-05-17T10:29:19Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z) - Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep
Visual Speech Recognition [90.61063126619182]
We evaluate the effects of different facial regions with state-of-the-art visual speech recognition models.
We find that incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance.
arXiv Detail & Related papers (2020-03-06T13:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.