Audio-Visual Class-Incremental Learning
- URL: http://arxiv.org/abs/2308.11073v3
- Date: Sun, 15 Oct 2023 00:05:28 GMT
- Title: Audio-Visual Class-Incremental Learning
- Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
- Abstract summary: We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition.
Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods.
- Score: 43.5426465012738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce audio-visual class-incremental learning, a
class-incremental learning scenario for audio-visual video recognition. We
demonstrate that joint audio-visual modeling can improve class-incremental
learning, but current methods fail to preserve semantic similarity between
audio and visual features as incremental step grows. Furthermore, we observe
that audio-visual correlations learned in previous tasks can be forgotten as
incremental steps progress, leading to poor performance. To overcome these
challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity
Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic
similarity between audio-visual modalities and Visual Attention Distillation
(VAD) to retain previously learned audio-guided visual attentive ability. We
create three audio-visual class-incremental datasets, AVE-Class-Incremental
(AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and
VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and
VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and
VS100-CI demonstrate that AV-CIL significantly outperforms existing
class-incremental learning methods in audio-visual class-incremental learning.
Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.
Related papers
- Siamese Vision Transformers are Scalable Audio-visual Learners [19.916919837694802]
We investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining.
Our framework uses a single shared vision transformer backbone to process audio and visual inputs.
Our method can robustly handle audio, visual, and audio-visual inputs with a single shared ViT backbone.
arXiv Detail & Related papers (2024-03-28T17:52:24Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Towards Generalisable Audio Representations for Audio-Visual Navigation [18.738943602529805]
In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments.
We propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder.
arXiv Detail & Related papers (2022-06-01T11:00:07Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.