A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification
- URL: http://arxiv.org/abs/2203.04114v1
- Date: Mon, 7 Mar 2022 07:29:55 GMT
- Title: A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification
- Authors: Qing Wang, Jun Du, Siyuan Zheng, Yunqing Li, Yajian Wang, Yuzhong Wu,
Hu Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui
Lee
- Abstract summary: We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
- Score: 64.59834310846516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose two techniques, namely joint modeling and data
augmentation, to improve system performances for audio-visual scene
classification (AVSC). We employ pre-trained networks trained only on image
data sets to extract video embedding; whereas for audio embedding models, we
decide to train them from scratch. We explore different neural network
architectures for joint modeling to effectively combine the video and audio
modalities. Moreover, data augmentation strategies are investigated to increase
audio-visual training set size. For the video modality the effectiveness of
several operations in RandAugment is verified. An audio-video joint mixup
scheme is proposed to further improve AVSC performances. Evaluated on the
development set of TAU Urban Audio Visual Scenes 2021, our final system can
achieve the best accuracy of 94.2% among all single AVSC systems submitted to
DCASE 2021 Task 1b.
Related papers
- AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy [26.975596225131824]
We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
arXiv Detail & Related papers (2022-04-25T03:37:02Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.