Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention
- URL: http://arxiv.org/abs/2012.14360v1
- Date: Mon, 28 Dec 2020 16:55:51 GMT
- Title: Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention
- Authors: Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Chin-Hui Lee, Bao-Cai Yin
- Abstract summary: We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
- Score: 98.52189797347354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel deep learning architecture to improving
word-level lip-reading. On the one hand, we first introduce the multi-scale
processing into the spatial feature extraction for lip-reading. Specially, we
proposed hierarchical pyramidal convolution (HPConv) to replace the standard
convolution in original module, leading to improvements over the model's
ability to discover fine-grained lip movements. On the other hand, we merge
information in all time steps of the sequence by utilizing self-attention, to
make the model pay more attention to the relevant frames. These two advantages
are combined together to further enhance the model's classification power.
Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed
model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over
the current state-of-the-art. We also conducted extensive experiments to better
understand the behavior of the proposed model.
Related papers
- Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals [14.741951369068877]
We find that ubiquitous time series (TS) forecasting models are prone to severe overfitting.
We introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method.
The proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.
arXiv Detail & Related papers (2024-02-04T03:54:31Z) - Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild [17.471128300990244]
Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by incorporating spatial attention module and channel attention module into lip-syncing strategy.
To our limited knowledge, this is the first attempt to introduce attention mechanism to the scheme of talking face generation.
arXiv Detail & Related papers (2022-03-08T10:18:25Z) - UnitedQA: A Hybrid Approach for Open Domain Question Answering [70.54286377610953]
We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models.
Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.
arXiv Detail & Related papers (2021-01-01T06:36:16Z) - Learn an Effective Lip Reading Model without Pains [96.21025771586159]
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics.
Most existing methods obtained high performance by constructing a complex neural network.
We find that making proper use of these strategies could always bring exciting improvements without changing much of the model.
arXiv Detail & Related papers (2020-11-15T15:29:19Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z) - Lipreading using Temporal Convolutional Networks [57.41253104365274]
Current model for recognition of isolated words in-the-wild consists of a residual network and Bi-directional Gated Recurrent Unit layers.
We address the limitations of this model and we propose changes which further improve its performance.
Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets.
arXiv Detail & Related papers (2020-01-23T17:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.