Lip reading using external viseme decoding
- URL: http://arxiv.org/abs/2104.04784v1
- Date: Sat, 10 Apr 2021 14:49:11 GMT
- Title: Lip reading using external viseme decoding
- Authors: Javad Peymanfard, Mohammad Reza Mohammadi, Hossein Zeinali and Nasser
Mozayani
- Abstract summary: This paper shows how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages.
Our proposed method improves word error rate by 4% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 dataset.
- Score: 4.728757318184405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip-reading is the operation of recognizing speech from lip movements. This
is a difficult task because the movements of the lips when pronouncing the
words are similar for some of them. Viseme is used to describe lip movements
during a conversation. This paper aims to show how to use external text data
(for viseme-to-character mapping) by dividing video-to-character into two
stages, namely converting video to viseme, and then converting viseme to
character by using separate models. Our proposed method improves word error
rate by 4\% compared to the normal sequence to sequence lip-reading model on
the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.
Related papers
- Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation [58.72068260933836]
Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
arXiv Detail & Related papers (2023-05-31T04:50:32Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - A Multimodal German Dataset for Automatic Lip Reading Systems and
Transfer Learning [18.862801476204886]
We present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament.
The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration.
By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models.
arXiv Detail & Related papers (2022-02-27T17:37:35Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z) - Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer.
It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences.
The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z) - A Study on Lip Localization Techniques used for Lip reading from a Video [0.0]
The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems.
The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache.
arXiv Detail & Related papers (2020-09-28T15:36:35Z) - Deformation Flow Based Two-Stream Network for Lip Reading [90.61063126619182]
Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking.
We observe the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme.
We introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading.
arXiv Detail & Related papers (2020-03-12T11:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.