Mutual Information Maximization for Effective Lip Reading
- URL: http://arxiv.org/abs/2003.06439v1
- Date: Fri, 13 Mar 2020 18:47:42 GMT
- Title: Mutual Information Maximization for Effective Lip Reading
- Authors: Xing Zhao and Shuang Yang and Shiguang Shan and Xilin Chen
- Abstract summary: We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
- Score: 99.11600901751673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading has received an increasing research interest in recent years due
to the rapid development of deep learning and its widespread potential
applications. One key point to obtain good performance for the lip reading task
depends heavily on how effective the representation can be to capture the lip
movement information and meanwhile to resist the noises resulted from the
change of pose, lighting conditions, speaker's appearance and so on. Towards
this target, we propose to introduce the mutual information constraints on both
the local feature's level and the global sequence's level to enhance the
relations of the features with the speech content. On the one hand, we
constraint the features generated at each time step to enable them carry a
strong relation with the speech content by imposing the local mutual
information maximization constraint (LMIM), leading to improvements over the
model's ability to discover fine-grained lip movements and the fine-grained
differences among words with similar pronunciation, such as ``spend'' and
``spending''. On the other hand, we introduce the mutual information
maximization constraint on the global sequence's level (GMIM), to make the
model be able to pay more attention to discriminate key frames related with the
speech content, and less to various noises appeared in the speaking process. By
combining these two advantages together, the proposed method is expected to be
both discriminative and robust for effective lip reading. To verify this
method, we evaluate on two large-scale benchmark. We perform a detailed
analysis and comparison on several aspects, including the comparison of the
LMIM and GMIM with the baseline, the visualization of the learned
representation and so on. The results not only prove the effectiveness of the
proposed method but also report new state-of-the-art performance on both the
two benchmarks.
Related papers
- Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z) - Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation
in the Wild [17.471128300990244]
Motivated by xxx, in this paper, an AttnWav2Lip model is proposed by incorporating spatial attention module and channel attention module into lip-syncing strategy.
To our limited knowledge, this is the first attempt to introduce attention mechanism to the scheme of talking face generation.
arXiv Detail & Related papers (2022-03-08T10:18:25Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z) - The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks [36.24509775775634]
We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer.
Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies.
arXiv Detail & Related papers (2020-12-17T12:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.