Learn an Effective Lip Reading Model without Pains
- URL: http://arxiv.org/abs/2011.07557v1
- Date: Sun, 15 Nov 2020 15:29:19 GMT
- Title: Learn an Effective Lip Reading Model without Pains
- Authors: Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen
- Abstract summary: Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics.
Most existing methods obtained high performance by constructing a complex neural network.
We find that making proper use of these strategies could always bring exciting improvements without changing much of the model.
- Score: 96.21025771586159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lip reading, also known as visual speech recognition, aims to recognize the
speech content from videos by analyzing the lip dynamics. There have been
several appealing progress in recent years, benefiting much from the rapidly
developed deep learning techniques and the recent large-scale lip-reading
datasets. Most existing methods obtained high performance by constructing a
complex neural network, together with several customized training strategies
which were always given in a very brief description or even shown only in the
source code. We find that making proper use of these strategies could always
bring exciting improvements without changing much of the model. Considering the
non-negligible effects of these strategies and the existing tough status to
train an effective lip reading model, we perform a comprehensive quantitative
study and comparative analysis, for the first time, to show the effects of
several different choices for lip reading. By only introducing some easy-to-get
refinements to the baseline pipeline, we obtain an obvious improvement of the
performance from 83.7% to 88.4% and from 38.2% to 55.7% on two largest public
available lip reading datasets, LRW and LRW-1000, respectively. They are
comparable and even surpass the existing state-of-the-art results.
Related papers
- Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Training Strategies for Improved Lip-reading [61.661446956793604]
We investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies.
A combination of all the methods results in a classification accuracy of 93.4%, which is an absolute improvement of 4.6% over the current state-of-the-art performance.
An error analysis of the various training strategies reveals that the performance improves by increasing the classification accuracy of hard-to-recognise words.
arXiv Detail & Related papers (2022-09-03T09:38:11Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.