Advances and Challenges in Deep Lip Reading
- URL: http://arxiv.org/abs/2110.07879v1
- Date: Fri, 15 Oct 2021 06:18:26 GMT
- Title: Advances and Challenges in Deep Lip Reading
- Authors: Marzieh Oghbaie, Arian Sabaghi, Kooshan Hashemifard, and Mohammad
Akbari
- Abstract summary: This paper provides a comprehensive survey of the state-of-the-art deep learning based Visual Speech Recognition research.
We focus on data challenges, task-specific complications, and the corresponding solutions.
We also discuss the main modules of a VSR pipeline and the influential datasets.
- Score: 2.930266486910376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driven by deep learning techniques and large-scale datasets, recent years
have witnessed a paradigm shift in automatic lip reading. While the main thrust
of Visual Speech Recognition (VSR) was improving accuracy of Audio Speech
Recognition systems, other potential applications, such as biometric
identification, and the promised gains of VSR systems, have motivated extensive
efforts on developing the lip reading technology. This paper provides a
comprehensive survey of the state-of-the-art deep learning based VSR research
with a focus on data challenges, task-specific complications, and the
corresponding solutions. Advancements in these directions will expedite the
transformation of silent speech interface from theory to practice. We also
discuss the main modules of a VSR pipeline and the influential datasets.
Finally, we introduce some typical VSR application concerns and impediments to
real-world scenarios as well as future research directions.
Related papers
- Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey [2.716339075963185]
Recent advancements in deep learning (DL) have posed a significant challenge for automatic speech recognition (ASR)
ASR relies on extensive training datasets, including confidential ones, and demands substantial computational and storage resources.
Advanced DL techniques like deep transfer learning (DTL), federated learning (FL), and reinforcement learning (RL) address these issues.
arXiv Detail & Related papers (2024-03-02T16:25:42Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Radio Frequency Fingerprinting via Deep Learning: Challenges and Opportunities [4.800138615859937]
Radio Frequency Fingerprinting (RFF) techniques promise to authenticate wireless devices at the physical layer based on inherent hardware imperfections introduced during manufacturing.
Recent advances in Machine Learning, particularly in Deep Learning (DL), have improved the ability of RFF systems to extract and learn complex features that make up the device-specific fingerprint.
This paper systematically identifies and analyzes the essential considerations and challenges encountered in the creation of DL-based RFF systems.
arXiv Detail & Related papers (2023-10-25T06:45:49Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Automated Speaker Independent Visual Speech Recognition: A Comprehensive
Survey [0.0]
Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements.
This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023.
arXiv Detail & Related papers (2023-06-14T07:33:43Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Visualizing Automatic Speech Recognition -- Means for a Better
Understanding? [0.1868368163807795]
We show how attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR.
Taking Speech Deep, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output.
arXiv Detail & Related papers (2022-02-01T13:35:08Z) - Deep Recurrent Encoder: A scalable end-to-end network to model brain
signals [122.1055193683784]
We propose an end-to-end deep learning architecture trained to predict the brain responses of multiple subjects at once.
We successfully test this approach on a large cohort of magnetoencephalography (MEG) recordings acquired during a one-hour reading task.
arXiv Detail & Related papers (2021-03-03T11:39:17Z) - Video Super Resolution Based on Deep Learning: A Comprehensive Survey [87.30395002197344]
We comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based on deep learning.
We propose a taxonomy and classify the methods into six sub-categories according to the ways of utilizing inter-frame information.
We summarize and compare the performance of the representative VSR method on some benchmark datasets.
arXiv Detail & Related papers (2020-07-25T13:39:54Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.