Deep Audio-Visual Learning: A Survey
- URL: http://arxiv.org/abs/2001.04758v1
- Date: Tue, 14 Jan 2020 13:11:21 GMT
- Title: Deep Audio-Visual Learning: A Survey
- Authors: Hao Zhu, Mandi Luo, Rui Wang, Aihua Zheng, and Ran He
- Abstract summary: We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
- Score: 53.487938108404244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual learning, aimed at exploiting the relationship between audio and
visual modalities, has drawn considerable attention since deep learning started
to be used successfully. Researchers tend to leverage these two modalities
either to improve the performance of previously considered single-modality
tasks or to address new challenging problems. In this paper, we provide a
comprehensive survey of recent audio-visual learning development. We divide the
current audio-visual learning tasks into four different subfields: audio-visual
separation and localization, audio-visual correspondence learning, audio-visual
generation, and audio-visual representation learning. State-of-the-art methods
as well as the remaining challenges of each subfield are further discussed.
Finally, we summarize the commonly used datasets and performance metrics.
Related papers
- Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review [0.0]
We present a systematic review of meta-learning methodologies in audio processing.
This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies.
We aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.
arXiv Detail & Related papers (2024-08-19T18:11:59Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning.
Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Learning in Audio-visual Context: A Review, Analysis, and New
Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field.
We introduce several key findings that have inspired our computational studies.
We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.