Related papers: Is Lip Region-of-Interest Sufficient for Lipreading?

Is Lip Region-of-Interest Sufficient for Lipreading?

URL: http://arxiv.org/abs/2205.14295v1
Date: Sat, 28 May 2022 01:34:24 GMT
Title: Is Lip Region-of-Interest Sufficient for Lipreading?
Authors: Jing-Xuan Zhang and Gen-Shun Wan and Jia Pan
Abstract summary: We propose to adopt the entire face for lipreading with self-supervised learning. AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments.
Score: 24.294559985408192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lip region-of-interest (ROI) is conventionally used for visual input in the lipreading task. Few works have adopted the entire face as visual input because lip-excluded parts of the face are usually considered to be redundant and irrelevant to visual speech recognition. However, faces contain much more detailed information than lips, such as speakers' head pose, emotion, identity etc. We argue that such information might benefit visual speech recognition if a powerful feature extractor employing the entire face is trained. In this work, we propose to adopt the entire face for lipreading with self-supervised learning. AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments. Our experimental results showed that adopting the entire face achieved 16% relative word error rate (WER) reduction on the lipreading task, compared with the baseline method using lip as visual input. Without self-supervised pretraining, the model with face input achieved a higher WER than that using lip input in the case of limited training data (30 hours), while a slightly lower WER when using large amount of training data (433 hours).

Related papers

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM) VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation. We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z)
Leveraging Visemes for Better Visual Speech Representation and Lip Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z)
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. Previous studies revealed the importance of lip-speech synchronization and visual quality. We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
Learn an Effective Lip Reading Model without Pains [96.21025771586159]
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. Most existing methods obtained high performance by constructing a complex neural network. We find that making proper use of these strategies could always bring exciting improvements without changing much of the model.
arXiv Detail & Related papers (2020-11-15T15:29:19Z)
A Study on Lip Localization Techniques used for Lip reading from a Video [0.0]
The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems. The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache.
arXiv Detail & Related papers (2020-09-28T15:36:35Z)
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition [90.61063126619182]
We evaluate the effects of different facial regions with state-of-the-art visual speech recognition models. We find that incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance.
arXiv Detail & Related papers (2020-03-06T13:52:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.