Related papers: Taming Modality Entanglement in Continual Audio-Visual Segmentation

Taming Modality Entanglement in Continual Audio-Visual Segmentation

URL: http://arxiv.org/abs/2510.17234v1
Date: Mon, 20 Oct 2025 07:23:36 GMT
Title: Taming Modality Entanglement in Continual Audio-Visual Segmentation
Authors: Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang,
Abstract summary: We introduce a novel Continual Audio-Visual (CAVS) task, aiming to continuously segment new classes guided by audio.<n>Two critical challenges are identified: 1) multi-modal semantic drift and 2) co-occurrence confusion.<n>A Collision-based Multi-modal Rehearsal framework is designed to address these challenges.
Score: 30.143320890304366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

Related papers

Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization [1.38120109831448]
We present our solution to the BinEgo-360 Challenge, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings.<n>Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals.<n>Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
arXiv Detail & Related papers (2025-12-12T00:34:51Z)
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [33.868900473146496]
We present HuMo, a framework for collaborative multimodal control.<n>HuMo surpasses specialized state-of-the-art methods in sub-tasks.
arXiv Detail & Related papers (2025-09-10T11:54:29Z)
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances across multiple modalities while relying on scarce labeled data.<n>We propose a Generative Transfer Learning framework by simulating how humans abstract and generalize concepts.<n>We show that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.
arXiv Detail & Related papers (2024-10-14T16:09:38Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language.<n>To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates.<n>We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition [10.36399200974439]
We introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB.
arXiv Detail & Related papers (2024-03-28T20:23:39Z)
A Multi-label Continual Learning Framework to Scale Deep Learning Approaches for Packaging Equipment Monitoring [57.5099555438223]
We study multi-label classification in the continual scenario for the first time. We propose an efficient approach that has a logarithmic complexity with regard to the number of tasks. We validate our approach on a real-world multi-label Forecasting problem from the packaging industry.
arXiv Detail & Related papers (2022-08-08T15:58:39Z)
On Steering Multi-Annotations per Sample for Multi-Task Learning [79.98259057711044]
The study of multi-task learning has drawn great attention from the community. Despite the remarkable progress, the challenge of optimally learning different tasks simultaneously remains to be explored. Previous works attempt to modify the gradients from different tasks. Yet these methods give a subjective assumption of the relationship between tasks, and the modified gradient may be less accurate. In this paper, we introduce Task Allocation(STA), a mechanism that addresses this issue by a task allocation approach, in which each sample is randomly allocated a subset of tasks. For further progress, we propose Interleaved Task Allocation(ISTA) to iteratively allocate all
arXiv Detail & Related papers (2022-03-06T11:57:18Z)
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning. CEN dynamically exchanges channels betweenworks of different modalities. For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z)
An Investigation of Replay-based Approaches for Continual Learning [79.0660895390689]
Continual learning (CL) is a major challenge of machine learning (ML) and describes the ability to learn several tasks sequentially without catastrophic forgetting (CF) Several solution classes have been proposed, of which so-called replay-based approaches seem very promising due to their simplicity and robustness. We empirically investigate replay-based approaches of continual learning and assess their potential for applications.
arXiv Detail & Related papers (2021-08-15T15:05:02Z)
Learning Invariant Representation for Continual Learning [5.979373021392084]
A key challenge in Continual learning is catastrophically forgetting previously learned tasks when the agent faces a new one. We propose a new pseudo-rehearsal-based method, named learning Invariant Representation for Continual Learning (IRCL) Disentangling the shared invariant representation helps to learn continually a sequence of tasks, while being more robust to forgetting and having better knowledge transfer.
arXiv Detail & Related papers (2021-01-15T15:12:51Z)
Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment [99.29153138760417]
Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality. We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities? Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
arXiv Detail & Related papers (2020-12-04T19:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.