SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory
- URL: http://arxiv.org/abs/2108.13630v1
- Date: Tue, 31 Aug 2021 05:54:16 GMT
- Title: SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory
- Authors: Zhijie Lin, Zhou Zhao, Haoyuan Li, Jinglin Liu, Meng Zhang, Xingshan
Zeng, Xiaofei He
- Abstract summary: We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
- Score: 61.44510300515693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lip reading, aiming to recognize spoken sentences according to the given
video of lip movements without relying on the audio stream, has attracted great
interest due to its application in many scenarios. Although prior works that
explore lip reading have obtained salient achievements, they are all trained in
a non-simultaneous manner where the predictions are generated requiring access
to the full video. To breakthrough this constraint, we study the task of
simultaneous lip reading and devise SimulLR, a simultaneous lip Reading
transducer with attention-guided adaptive memory from three aspects: (1) To
address the challenge of monotonic alignments while considering the syntactic
structure of the generated sentences under simultaneous setting, we build a
transducer-based model and design several effective training strategies
including CTC pre-training, model warm-up and curriculum learning to promote
the training of the lip reading transducer. (2) To learn better spatio-temporal
representations for simultaneous encoder, we construct a truncated 3D
convolution and time-restricted self-attention layer to perform the
frame-to-frame interaction within a video segment containing fixed number of
frames. (3) The history information is always limited due to the storage in
real-time scenarios, especially for massive video data. Therefore, we devise a
novel attention-guided adaptive memory to organize semantic information of
history segments and enhance the visual representations with acceptable
computation-aware latency. The experiments show that the SimulLR achieves the
translation speedup 9.10$\times$ compared with the state-of-the-art
non-simultaneous methods, and also obtains competitive results, which indicates
the effectiveness of our proposed methods.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.