Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy
- URL: http://arxiv.org/abs/2204.11420v1
- Date: Mon, 25 Apr 2022 03:37:02 GMT
- Title: Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy
- Authors: Chengxin Chen, Meng Wang, Pengyuan Zhang
- Abstract summary: We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
- Score: 26.975596225131824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, audio-visual scene classification (AVSC) has attracted increasing
attention from multidisciplinary communities. Previous studies tended to adopt
a pipeline training strategy, which uses well-trained visual and acoustic
encoders to extract high-level representations (embeddings) first, then
utilizes them to train the audio-visual classifier. In this way, the extracted
embeddings are well suited for uni-modal classifiers, but not necessarily
suited for multi-modal ones. In this paper, we propose a joint training
framework, using the acoustic features and raw images directly as inputs for
the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image
models as visual encoder, and jointly optimize the scene classifier and 1D-CNN
based acoustic encoder during training. We evaluate the approach on the
development dataset of TAU Urban Audio-Visual Scenes 2021. The experimental
results show that our proposed approach achieves significant improvement over
the conventional pipeline training strategy. Moreover, our best single system
outperforms previous state-of-the-art methods, yielding a log loss of 0.1517
and accuracy of 94.59% on the official test fold.
Related papers
- Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Unsupervised Discriminative Learning of Sounds for Audio Event
Classification [43.81789898864507]
Network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet.
We show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training.
arXiv Detail & Related papers (2021-05-19T17:42:03Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.