Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep
Visual Speech Recognition
- URL: http://arxiv.org/abs/2003.03206v2
- Date: Mon, 9 Mar 2020 06:06:20 GMT
- Title: Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep
Visual Speech Recognition
- Authors: Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen
- Abstract summary: We evaluate the effects of different facial regions with state-of-the-art visual speech recognition models.
We find that incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance.
- Score: 90.61063126619182
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning have heightened interest among researchers
in the field of visual speech recognition (VSR). Currently, most existing
methods equate VSR with automatic lip reading, which attempts to recognise
speech by analysing lip motion. However, human experience and psychological
studies suggest that we do not always fix our gaze at each other's lips during
a face-to-face conversation, but rather scan the whole face repetitively. This
inspires us to revisit a fundamental yet somehow overlooked problem: can VSR
models benefit from reading extraoral facial regions, i.e. beyond the lips? In
this paper, we perform a comprehensive study to evaluate the effects of
different facial regions with state-of-the-art VSR models, including the mouth,
the whole face, the upper face, and even the cheeks. Experiments are conducted
on both word-level and sentence-level benchmarks with different
characteristics. We find that despite the complex variations of the data,
incorporating information from extraoral facial regions, even the upper face,
consistently benefits VSR performance. Furthermore, we introduce a simple yet
effective method based on Cutout to learn more discriminative features for
face-based VSR, hoping to maximise the utility of information encoded in
different facial regions. Our experiments show obvious improvements over
existing state-of-the-art methods that use only the lip region as inputs, a
result we believe would probably provide the VSR community with some new and
exciting insights.
Related papers
- Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
Supervision [60.54020550732634]
We study the potential of leveraging synthetic visual data for visual speech recognition (VSR)
Key idea is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech.
We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3)
arXiv Detail & Related papers (2023-03-30T07:43:27Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial
Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets.
CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z) - Is Lip Region-of-Interest Sufficient for Lipreading? [24.294559985408192]
We propose to adopt the entire face for lipreading with self-supervised learning.
AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments.
arXiv Detail & Related papers (2022-05-28T01:34:24Z) - Visualizing Automatic Speech Recognition -- Means for a Better
Understanding? [0.1868368163807795]
We show how attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR.
Taking Speech Deep, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output.
arXiv Detail & Related papers (2022-02-01T13:35:08Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
arXiv Detail & Related papers (2020-05-17T10:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.