Is Lip Region-of-Interest Sufficient for Lipreading?
- URL: http://arxiv.org/abs/2205.14295v1
- Date: Sat, 28 May 2022 01:34:24 GMT
- Title: Is Lip Region-of-Interest Sufficient for Lipreading?
- Authors: Jing-Xuan Zhang and Gen-Shun Wan and Jia Pan
- Abstract summary: We propose to adopt the entire face for lipreading with self-supervised learning.
AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments.
- Score: 24.294559985408192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lip region-of-interest (ROI) is conventionally used for visual input in the
lipreading task. Few works have adopted the entire face as visual input because
lip-excluded parts of the face are usually considered to be redundant and
irrelevant to visual speech recognition. However, faces contain much more
detailed information than lips, such as speakers' head pose, emotion, identity
etc. We argue that such information might benefit visual speech recognition if
a powerful feature extractor employing the entire face is trained. In this
work, we propose to adopt the entire face for lipreading with self-supervised
learning. AV-HuBERT, an audio-visual multi-modal self-supervised learning
framework, was adopted in our experiments. Our experimental results showed that
adopting the entire face achieved 16% relative word error rate (WER) reduction
on the lipreading task, compared with the baseline method using lip as visual
input. Without self-supervised pretraining, the model with face input achieved
a higher WER than that using lip input in the case of limited training data (30
hours), while a slightly lower WER when using large amount of training data
(433 hours).
Related papers
- Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Learn an Effective Lip Reading Model without Pains [96.21025771586159]
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics.
Most existing methods obtained high performance by constructing a complex neural network.
We find that making proper use of these strategies could always bring exciting improvements without changing much of the model.
arXiv Detail & Related papers (2020-11-15T15:29:19Z) - A Study on Lip Localization Techniques used for Lip reading from a Video [0.0]
The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems.
The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache.
arXiv Detail & Related papers (2020-09-28T15:36:35Z) - Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep
Visual Speech Recognition [90.61063126619182]
We evaluate the effects of different facial regions with state-of-the-art visual speech recognition models.
We find that incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance.
arXiv Detail & Related papers (2020-03-06T13:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.