Learning in Audio-visual Context: A Review, Analysis, and New
Perspective
- URL: http://arxiv.org/abs/2208.09579v1
- Date: Sat, 20 Aug 2022 02:15:44 GMT
- Title: Learning in Audio-visual Context: A Review, Analysis, and New
Perspective
- Authors: Yake Wei, Di Hu, Yapeng Tian, Xuelong Li
- Abstract summary: This survey aims to systematically organize and analyze studies of the audio-visual field.
We introduce several key findings that have inspired our computational studies.
We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
- Score: 88.40519011197144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sight and hearing are two senses that play a vital role in human
communication and scene understanding. To mimic human perception ability,
audio-visual learning, aimed at developing computational approaches to learn
from both audio and visual modalities, has been a flourishing field in recent
years. A comprehensive survey that can systematically organize and analyze
studies of the audio-visual field is expected. Starting from the analysis of
audio-visual cognition foundations, we introduce several key findings that have
inspired our computational studies. Then, we systematically review the recent
audio-visual learning studies and divide them into three categories:
audio-visual boosting, cross-modal perception and audio-visual collaboration.
Through our analysis, we discover that, the consistency of audio-visual data
across semantic, spatial and temporal support the above studies. To revisit the
current development of the audio-visual learning field from a more macro view,
we further propose a new perspective on audio-visual scene understanding, then
discuss and analyze the feasible future direction of the audio-visual learning
area. Overall, this survey reviews and outlooks the current audio-visual
learning field from different aspects. We hope it can provide researchers with
a better understanding of this area. A website including constantly-updated
survey is released: \url{https://gewu-lab.github.io/audio-visual-learning/}.
Related papers
- Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective [12.178918299455898]
This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities.
Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
arXiv Detail & Related papers (2024-09-29T06:30:46Z) - Estimating Visual Information From Audio Through Manifold Learning [14.113590443352495]
We propose a new framework for extracting visual information about a scene only using audio signals.
Our framework is based on Manifold Learning and consists of two steps.
We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset.
arXiv Detail & Related papers (2022-08-03T20:47:11Z) - Deep Learning for Visual Speech Analysis: A Survey [54.53032361204449]
This paper presents a review of recent progress in deep learning methods on visual speech analysis.
We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance.
arXiv Detail & Related papers (2022-05-22T14:44:53Z) - Recent Advances and Challenges in Deep Audio-Visual Correlation Learning [7.273353828127817]
This paper focuses on state-of-the-art (SOTA) models used to learn correlations between audio and video.
We also discuss some tasks of definition and paradigm applied in AI multimedia.
arXiv Detail & Related papers (2022-02-28T10:43:01Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.