Automated Speaker Independent Visual Speech Recognition: A Comprehensive
Survey
- URL: http://arxiv.org/abs/2306.08314v1
- Date: Wed, 14 Jun 2023 07:33:43 GMT
- Title: Automated Speaker Independent Visual Speech Recognition: A Comprehensive
Survey
- Authors: Praneeth Nemani, G. Sai Krishna, Supriya Kundrapu
- Abstract summary: Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements.
This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Speaker-independent VSR is a complex task that involves identifying spoken
words or phrases from video recordings of a speaker's facial movements. Over
the years, there has been a considerable amount of research in the field of VSR
involving different algorithms and datasets to evaluate system performance.
These efforts have resulted in significant progress in developing effective VSR
models, creating new opportunities for further research in this area. This
survey provides a detailed examination of the progression of VSR over the past
three decades, with a particular emphasis on the transition from
speaker-dependent to speaker-independent systems. We also provide a
comprehensive overview of the various datasets used in VSR research and the
preprocessing techniques employed to achieve speaker independence. The survey
covers the works published from 1990 to 2023, thoroughly analyzing each work
and comparing them on various parameters. This survey provides an in-depth
analysis of speaker-independent VSR systems evolution from 1990 to 2023. It
outlines the development of VSR systems over time and highlights the need to
develop end-to-end pipelines for speaker-independent VSR. The pictorial
representation offers a clear and concise overview of the techniques used in
speaker-independent VSR, thereby aiding in the comprehension and analysis of
the various methodologies. The survey also highlights the strengths and
limitations of each technique and provides insights into developing novel
approaches for analyzing visual speech cues. Overall, This comprehensive review
provides insights into the current state-of-the-art speaker-independent VSR and
highlights potential areas for future research.
Related papers
- Retrieval-Augmented Audio Deepfake Detection [27.13059118273849]
We propose a retrieval-augmented detection framework that augments test samples with similar retrieved samples for enhanced detection.
Experiments show the superior performance of the proposed RAD framework over baseline methods.
arXiv Detail & Related papers (2024-04-22T05:46:40Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - A Survey on Interpretable Cross-modal Reasoning [64.37362731950843]
Cross-modal reasoning (CMR) has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.
This survey delves into the realm of interpretable cross-modal reasoning (I-CMR)
This survey presents a comprehensive overview of the typical methods with a three-level taxonomy for I-CMR.
arXiv Detail & Related papers (2023-09-05T05:06:48Z) - HEAR 2021: Holistic Evaluation of Audio Representations [55.324557862041985]
The HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning.
HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.
Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets.
arXiv Detail & Related papers (2022-03-06T18:13:09Z) - Advances and Challenges in Deep Lip Reading [2.930266486910376]
This paper provides a comprehensive survey of the state-of-the-art deep learning based Visual Speech Recognition research.
We focus on data challenges, task-specific complications, and the corresponding solutions.
We also discuss the main modules of a VSR pipeline and the influential datasets.
arXiv Detail & Related papers (2021-10-15T06:18:26Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Video Super Resolution Based on Deep Learning: A Comprehensive Survey [87.30395002197344]
We comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based on deep learning.
We propose a taxonomy and classify the methods into six sub-categories according to the ways of utilizing inter-frame information.
We summarize and compare the performance of the representative VSR method on some benchmark datasets.
arXiv Detail & Related papers (2020-07-25T13:39:54Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.