A Study on Lip Localization Techniques used for Lip reading from a Video
- URL: http://arxiv.org/abs/2009.13420v1
- Date: Mon, 28 Sep 2020 15:36:35 GMT
- Title: A Study on Lip Localization Techniques used for Lip reading from a Video
- Authors: S.D. Lalitha, K.K. Thyagharajan
- Abstract summary: The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems.
The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper some of the different techniques used to localize the lips from
the face are discussed and compared along with its processing steps. Lip
localization is the basic step needed to read the lips for extracting visual
information from the video input. The techniques could be applied on asymmetric
lips and also on the mouth with visible teeth, tongue & mouth with moustache.
In the process of Lip reading the following steps are generally used. They are,
initially locating lips in the first frame of the video input, then tracking
the lips in the following frames using the resulting pixel points of initial
step and at last converting the tracked lip model to its corresponding matched
letter to give the visual information. A new proposal is also initiated from
the discussed techniques. The lip reading is useful in Automatic Speech
Recognition when the audio is absent or present low with or without noise in
the communication systems. Human Computer communication also will require
speech recognition.
Related papers
- Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert [13.60808166889775]
We introduce a method for speech-driven 3D facial animation to generate accurate lip movements.
This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts.
We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance.
arXiv Detail & Related papers (2024-07-01T07:39:28Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation [58.72068260933836]
Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
arXiv Detail & Related papers (2023-05-31T04:50:32Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - Is Lip Region-of-Interest Sufficient for Lipreading? [24.294559985408192]
We propose to adopt the entire face for lipreading with self-supervised learning.
AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments.
arXiv Detail & Related papers (2022-05-28T01:34:24Z) - Lip reading using external viseme decoding [4.728757318184405]
This paper shows how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages.
Our proposed method improves word error rate by 4% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 dataset.
arXiv Detail & Related papers (2021-04-10T14:49:11Z) - Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
arXiv Detail & Related papers (2020-05-17T10:29:19Z) - Deformation Flow Based Two-Stream Network for Lip Reading [90.61063126619182]
Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking.
We observe the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme.
We introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading.
arXiv Detail & Related papers (2020-03-12T11:13:44Z) - Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep
Visual Speech Recognition [90.61063126619182]
We evaluate the effects of different facial regions with state-of-the-art visual speech recognition models.
We find that incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance.
arXiv Detail & Related papers (2020-03-06T13:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.