Related papers: Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

URL: http://arxiv.org/abs/2107.13180v1
Date: Wed, 28 Jul 2021 06:10:10 GMT
Title: Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification
Authors: Javier Naranjo-Alcazar, Sergi Perez-Castanos, Aaron Lopez-Garcia, Pedro Zuccarello, Maximo Cobos, Francesc J. Ferri
Abstract summary: This paper presents a multi-modal model for automatic scene classification. It exploits simultaneously auditory and visual information. It has been shown to provide an excellent trade-off between prediction performance and system complexity.
Score: 4.191965713559235
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.

Related papers

Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm. Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC) Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech. To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z)
Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines [5.452798072984612]
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration.
arXiv Detail & Related papers (2020-07-11T16:47:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.