Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification
- URL: http://arxiv.org/abs/2107.13180v1
- Date: Wed, 28 Jul 2021 06:10:10 GMT
- Title: Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification
- Authors: Javier Naranjo-Alcazar, Sergi Perez-Castanos, Aaron Lopez-Garcia,
Pedro Zuccarello, Maximo Cobos, Francesc J. Ferri
- Abstract summary: This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
- Score: 4.191965713559235
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The use of multiple and semantically correlated sources can provide
complementary information to each other that may not be evident when working
with individual modalities on their own. In this context, multi-modal models
can help producing more accurate and robust predictions in machine learning
tasks where audio-visual data is available. This paper presents a multi-modal
model for automatic scene classification that exploits simultaneously auditory
and visual information. The proposed approach makes use of two separate
networks which are respectively trained in isolation on audio and visual data,
so that each network specializes in a given modality. The visual subnetwork is
a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the
residual audio subnetwork is based on stacked squeeze-excitation convolutional
blocks trained from scratch. After training each subnetwork, the fusion of
information from the audio and visual streams is performed at two different
stages. The early fusion stage combines features resulting from the last
convolutional block of the respective subnetworks at different time steps to
feed a bidirectional recurrent structure. The late fusion stage combines the
output of the early fusion stage with the independent predictions provided by
the two subnetworks, resulting in the final prediction. We evaluate the method
using the recently published TAU Audio-Visual Urban Scenes 2021, which contains
synchronized audio and video recordings from 12 European cities in 10 different
scene classes. The proposed model has been shown to provide an excellent
trade-off between prediction performance (86.5%) and system complexity (15M
parameters) in the evaluation results of the DCASE 2021 Challenge.
Related papers
- Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Look and Listen: A Multi-modality Late Fusion Approach to Scene
Classification for Autonomous Machines [5.452798072984612]
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion.
The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects.
We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration.
arXiv Detail & Related papers (2020-07-11T16:47:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.