Deep Learning for Visual Speech Analysis: A Survey
- URL: http://arxiv.org/abs/2205.10839v2
- Date: Thu, 14 Mar 2024 05:06:18 GMT
- Title: Deep Learning for Visual Speech Analysis: A Survey
- Authors: Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, Li Liu,
- Abstract summary: This paper presents a review of recent progress in deep learning methods on visual speech analysis.
We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance.
- Score: 54.53032361204449
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.
Related papers
- Learning in Audio-visual Context: A Review, Analysis, and New
Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field.
We introduce several key findings that have inspired our computational studies.
We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition [2.1485350418225244]
Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis.
We propose a new deep learning-based approach for audio-visual emotion recognition.
arXiv Detail & Related papers (2021-03-16T15:49:15Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Deep learning for scene recognition from visual data: a survey [2.580765958706854]
This work aims to be a review of the state-of-the-art in scene recognition with deep learning models from visual data.
Scene recognition is still an emerging field in computer vision, which has been addressed from a single image and dynamic image perspective.
arXiv Detail & Related papers (2020-07-03T16:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.