Related papers: Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

URL: http://arxiv.org/abs/2202.13673v1
Date: Mon, 28 Feb 2022 10:43:01 GMT
Title: Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Authors: Lu\'is Vila\c{c}a, Yi Yu and Paula Viana
Abstract summary: This paper focuses on state-of-the-art (SOTA) models used to learn correlations between audio and video. We also discuss some tasks of definition and paradigm applied in AI multimedia.
Score: 7.273353828127817
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Audio-visual correlation learning aims to capture essential correspondences and understand natural phenomena between audio and video. With the rapid growth of deep learning, an increasing amount of attention has been paid to this emerging research issue. Through the past few years, various methods and datasets have been proposed for audio-visual correlation learning, which motivate us to conclude a comprehensive survey. This survey paper focuses on state-of-the-art (SOTA) models used to learn correlations between audio and video, but also discusses some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate some objective functions frequently used for optimizing audio-visual correlation learning models and discuss how audio-visual data is exploited in the optimization process. Most importantly, we provide an extensive comparison and summarization of the recent progress of SOTA audio-visual correlation learning and discuss future research directions.

Related papers

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation [62.218932509432314]
Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames. We learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation.
arXiv Detail & Related papers (2025-04-08T07:23:28Z)
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning [6.595840767689357]
Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data. We provide a summarization of the recent progress of Audio-Visual Correlation Learning and discuss the future research directions.
arXiv Detail & Related papers (2024-11-24T03:26:34Z)
Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review [0.0]
We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies. We aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.
arXiv Detail & Related papers (2024-08-19T18:11:59Z)
Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances. Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV. We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z)
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks. This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance. We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z)
Learning in Audio-visual Context: A Review, Analysis, and New Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field. We introduce several key findings that have inspired our computational studies. We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z)
Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation. For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields. We discuss state-of-the-art methods as well as the remaining challenges of each subfield. We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.