Related papers: Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

URL: http://arxiv.org/abs/2306.08314v1
Date: Wed, 14 Jun 2023 07:33:43 GMT
Title: Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Authors: Praneeth Nemani, G. Sai Krishna, Supriya Kundrapu
Abstract summary: Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.

Related papers

Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects [104.38752472521917]
Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques.<n>We present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR)
arXiv Detail & Related papers (2025-09-19T17:17:42Z)
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [52.29009595100625]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z)
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning [6.595840767689357]
Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data. We provide a summarization of the recent progress of Audio-Visual Correlation Learning and discuss the future research directions.
arXiv Detail & Related papers (2024-11-24T03:26:34Z)
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems. The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness. This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z)
Retrieval-Augmented Audio Deepfake Detection [27.13059118273849]
We propose a retrieval-augmented detection framework that augments test samples with similar retrieved samples for enhanced detection. Experiments show the superior performance of the proposed RAD framework over baseline methods.
arXiv Detail & Related papers (2024-04-22T05:46:40Z)
AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z)
A Survey on Interpretable Cross-modal Reasoning [64.37362731950843]
Cross-modal reasoning (CMR) has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics. This survey delves into the realm of interpretable cross-modal reasoning (I-CMR) This survey presents a comprehensive overview of the typical methods with a three-level taxonomy for I-CMR.
arXiv Detail & Related papers (2023-09-05T05:06:48Z)
HEAR 2021: Holistic Evaluation of Audio Representations [55.324557862041985]
The HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets.
arXiv Detail & Related papers (2022-03-06T18:13:09Z)
Advances and Challenges in Deep Lip Reading [2.930266486910376]
This paper provides a comprehensive survey of the state-of-the-art deep learning based Visual Speech Recognition research. We focus on data challenges, task-specific complications, and the corresponding solutions. We also discuss the main modules of a VSR pipeline and the influential datasets.
arXiv Detail & Related papers (2021-10-15T06:18:26Z)
Video Super Resolution Based on Deep Learning: A Comprehensive Survey [87.30395002197344]
We comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based on deep learning. We propose a taxonomy and classify the methods into six sub-categories according to the ways of utilizing inter-frame information. We summarize and compare the performance of the representative VSR method on some benchmark datasets.
arXiv Detail & Related papers (2020-07-25T13:39:54Z)
Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.