Class-Incremental Grouping Network for Continual Audio-Visual Learning
- URL: http://arxiv.org/abs/2309.05281v1
- Date: Mon, 11 Sep 2023 07:36:16 GMT
- Title: Class-Incremental Grouping Network for Continual Audio-Visual Learning
- Authors: Shentong Mo, Weiguo Pian, Yapeng Tian
- Abstract summary: We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
- Score: 42.284785756540806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual learning is a challenging problem in which models need to be
trained on non-stationary data across sequential tasks for class-incremental
learning. While previous methods have focused on using either regularization or
rehearsal-based frameworks to alleviate catastrophic forgetting in image
classification, they are limited to a single modality and cannot learn compact
class-aware cross-modal representations for continual audio-visual learning. To
address this gap, we propose a novel class-incremental grouping network (CIGN)
that can learn category-wise semantic features to achieve continual
audio-visual learning. Our CIGN leverages learnable audio-visual class tokens
and audio-visual grouping to continually aggregate class-aware features.
Additionally, it utilizes class tokens distillation and continual grouping to
prevent forgetting parameters learned from previous tasks, thereby improving
the model's ability to capture discriminative audio-visual categories. We
conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and
VGG-Sound Sources benchmarks. Our experimental results demonstrate that the
CIGN achieves state-of-the-art audio-visual class-incremental learning
performance. Code is available at https://github.com/stoneMo/CIGN.
Related papers
- Audio-visual Generalized Zero-shot Learning the Easy Way [20.60905505473906]
We introduce EZ-AVGZL, which aligns audio-visual embeddings with transformed text representations.
We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks.
arXiv Detail & Related papers (2024-07-18T01:57:16Z) - Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features.
We propose a simple yet effective model that only relies on feed-forward neural networks.
Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z) - Boosting Audio-visual Zero-shot Learning with Large Language Models [32.533844163120875]
We introduce a framework called KnowleDge-Augmented audio-visual learning (KDA)
Our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets.
arXiv Detail & Related papers (2023-11-21T01:18:23Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Audio-Visual Class-Incremental Learning [43.5426465012738]
We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition.
Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods.
arXiv Detail & Related papers (2023-08-21T22:43:47Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Temporal and cross-modal attention for audio-visual zero-shot learning [38.02396786726476]
generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information.
We propose a multi-modal and Temporal Cross-attention Framework (modelName) for audio-visual generalised zero-shot learning.
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the ucf, vgg, and activity benchmarks for (generalised) zero-shot learning.
arXiv Detail & Related papers (2022-07-20T15:19:30Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.