Deformation Flow Based Two-Stream Network for Lip Reading
- URL: http://arxiv.org/abs/2003.05709v2
- Date: Fri, 13 Mar 2020 00:54:46 GMT
- Title: Deformation Flow Based Two-Stream Network for Lip Reading
- Authors: Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen
- Abstract summary: Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking.
We observe the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme.
We introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading.
- Score: 90.61063126619182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading is the task of recognizing the speech content by analyzing
movements in the lip region when people are speaking. Observing on the
continuity in adjacent frames in the speaking process, and the consistency of
the motion patterns among different speakers when they pronounce the same
phoneme, we model the lip movements in the speaking process as a sequence of
apparent deformations in the lip region. Specifically, we introduce a
Deformation Flow Network (DFN) to learn the deformation flow between adjacent
frames, which directly captures the motion information within the lip region.
The learned deformation flow is then combined with the original grayscale
frames with a two-stream network to perform lip reading. Different from
previous two-stream networks, we make the two streams learn from each other in
the learning process by introducing a bidirectional knowledge distillation loss
to train the two branches jointly. Owing to the complementary cues provided by
different branches, the two-stream network shows a substantial improvement over
using either single branch. A thorough experimental evaluation on two
large-scale lip reading benchmarks is presented with detailed analysis. The
results accord with our motivation, and show that our method achieves
state-of-the-art or comparable performance on these two challenging datasets.
Related papers
- Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark
Transformers [43.13868262922689]
State-of-the-art lipreading methods excel in interpreting overlap speakers.
Generalizing these methods to unseen speakers incurs catastrophic performance degradation.
We develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer.
arXiv Detail & Related papers (2023-02-04T10:22:18Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z) - Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism [1.52292571922932]
Given an observable image sequences of lips, can we picture the corresponding tongue motion?
We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism.
The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
arXiv Detail & Related papers (2021-06-20T10:51:23Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Lip reading using external viseme decoding [4.728757318184405]
This paper shows how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages.
Our proposed method improves word error rate by 4% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 dataset.
arXiv Detail & Related papers (2021-04-10T14:49:11Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.