A Two-Stage Approach to Device-Robust Acoustic Scene Classification
- URL: http://arxiv.org/abs/2011.01447v1
- Date: Tue, 3 Nov 2020 03:27:18 GMT
- Title: A Two-Stage Approach to Device-Robust Acoustic Scene Classification
- Authors: Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian
Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun
Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee
- Abstract summary: Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
- Score: 63.98724740606457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To improve device robustness, a highly desirable key feature of a competitive
data-driven acoustic scene classification (ASC) system, a novel two-stage
system based on fully convolutional neural networks (CNNs) is proposed. Our
two-stage system leverages on an ad-hoc score combination based on two CNN
classifiers: (i) the first CNN classifies acoustic inputs into one of three
broad classes, and (ii) the second CNN classifies the same inputs into one of
ten finer-grained classes. Three different CNN architectures are explored to
implement the two-stage classifiers, and a frequency sub-sampling scheme is
investigated. Moreover, novel data augmentation schemes for ASC are also
investigated. Evaluated on DCASE 2020 Task 1a, our results show that the
proposed ASC system attains a state-of-the-art accuracy on the development set,
where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9%
average accuracy among multi-device test data, and it obtains a significant
improvement on unseen devices. Finally, neural saliency analysis with class
activation mapping (CAM) gives new insights on the patterns learnt by our
models.
Related papers
- AFEN: Respiratory Disease Classification using Ensemble Learning [2.524195881002773]
We present AFEN (Audio Feature Learning), a model that leverages Convolutional Neural Networks (CNN) and XGBoost.
We use a meticulously selected mix of audio features which provide the salient attributes of the data and allow for accurate classification.
We empirically verify that AFEN sets a new state-of-theart using Precision and Recall as metrics, while decreasing training time by 60%.
arXiv Detail & Related papers (2024-05-08T23:50:54Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Automatic Machine Learning for Multi-Receiver CNN Technology Classifiers [16.244541005112747]
Convolutional Neural Networks (CNNs) are one of the most studied family of deep learning models for signal classification.
We focus on technology classification based on raw I/Q samples collected from multiple synchronized receivers.
arXiv Detail & Related papers (2022-04-28T23:41:38Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.