Continual Audio-Visual Sound Separation
- URL: http://arxiv.org/abs/2411.02860v1
- Date: Tue, 05 Nov 2024 07:09:14 GMT
- Title: Continual Audio-Visual Sound Separation
- Authors: Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian,
- Abstract summary: We introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes.
We propose a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks.
Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines.
- Score: 35.06195539944879
- License:
- Abstract: In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Prompting Segmentation with Sound Is Generalizable Audio-Visual Source
Localizer [22.846623384472377]
We introduce the encoder-prompt-decoder paradigm to decode localization from the fused audio-visual feature.
Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects.
We develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model.
arXiv Detail & Related papers (2023-09-13T05:43:35Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Learning Audio-Visual Correlations from Variational Cross-Modal
Generation [35.07257471319274]
We learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner.
The learned correlations can be readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval.
arXiv Detail & Related papers (2021-02-05T21:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.