Audio-Visual Class-Incremental Learning
- URL: http://arxiv.org/abs/2308.11073v3
- Date: Sun, 15 Oct 2023 00:05:28 GMT
- Title: Audio-Visual Class-Incremental Learning
- Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
- Abstract summary: We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition.
Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods.
- Score: 43.5426465012738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce audio-visual class-incremental learning, a
class-incremental learning scenario for audio-visual video recognition. We
demonstrate that joint audio-visual modeling can improve class-incremental
learning, but current methods fail to preserve semantic similarity between
audio and visual features as incremental step grows. Furthermore, we observe
that audio-visual correlations learned in previous tasks can be forgotten as
incremental steps progress, leading to poor performance. To overcome these
challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity
Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic
similarity between audio-visual modalities and Visual Attention Distillation
(VAD) to retain previously learned audio-guided visual attentive ability. We
create three audio-visual class-incremental datasets, AVE-Class-Incremental
(AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and
VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and
VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and
VS100-CI demonstrate that AV-CIL significantly outperforms existing
class-incremental learning methods in audio-visual class-incremental learning.
Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.
Related papers
- Continual Audio-Visual Sound Separation [35.06195539944879]
We introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes.
We propose a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks.
Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines.
arXiv Detail & Related papers (2024-11-05T07:09:14Z) - From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.