Look and Listen: A Multi-modality Late Fusion Approach to Scene
Classification for Autonomous Machines
- URL: http://arxiv.org/abs/2007.10175v1
- Date: Sat, 11 Jul 2020 16:47:05 GMT
- Title: Look and Listen: A Multi-modality Late Fusion Approach to Scene
Classification for Autonomous Machines
- Authors: Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anik\'o Ek\'art,
George Vogiatzis
- Abstract summary: The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion.
The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects.
We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration.
- Score: 5.452798072984612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The novelty of this study consists in a multi-modality approach to scene
classification, where image and audio complement each other in a process of
deep late fusion. The approach is demonstrated on a difficult classification
problem, consisting of two synchronised and balanced datasets of 16,000 data
objects, encompassing 4.4 hours of video of 8 environments with varying degrees
of similarity. We first extract video frames and accompanying audio at one
second intervals. The image and the audio datasets are first classified
independently, using a fine-tuned VGG16 and an evolutionary optimised deep
neural network, with accuracies of 89.27% and 93.72%, respectively. This is
followed by late fusion of the two neural networks to enable a higher order
function, leading to accuracy of 96.81% in this multi-modality classifier with
synchronised video frames and audio clips. The tertiary neural network
implemented for late fusion outperforms classical state-of-the-art classifiers
by around 3% when the two primary networks are considered as feature
generators. We show that situations where a single-modality may be confused by
anomalous data points are now corrected through an emerging higher order
integration. Prominent examples include a water feature in a city misclassified
as a river by the audio classifier alone and a densely crowded street
misclassified as a forest by the image classifier alone. Both are examples
which are correctly classified by our multi-modality approach.
Related papers
- Decoupled Mixup for Generalized Visual Recognition [71.13734761715472]
We propose a novel "Decoupled-Mixup" method to train CNN models for visual recognition.
Our method decouples each image into discriminative and noise-prone regions, and then heterogeneously combines these regions to train CNN models.
Experiment results show the high generalization performance of our method on testing data that are composed of unseen contexts.
arXiv Detail & Related papers (2022-10-26T15:21:39Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Correlated Input-Dependent Label Noise in Large-Scale Image
Classification [4.979361059762468]
We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in datasets.
We demonstrate that the learned covariance structure captures known sources of label noise between semantically similar and co-occurring classes.
We set a new state-of-the-art result on WebVision 1.0 with 76.6% top-1 accuracy.
arXiv Detail & Related papers (2021-05-19T17:30:59Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Not only Look, but also Listen: Learning Multimodal Violence Detection
under Weak Supervision [10.859792341257931]
We first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours.
We propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features.
Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark.
arXiv Detail & Related papers (2020-07-09T10:29:31Z) - End-to-End Lip Synchronisation Based on Pattern Classification [15.851638021923875]
We propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream.
We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
arXiv Detail & Related papers (2020-05-18T11:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.