Low-complexity deep learning frameworks for acoustic scene
classification
- URL: http://arxiv.org/abs/2206.06057v1
- Date: Mon, 13 Jun 2022 11:41:39 GMT
- Title: Low-complexity deep learning frameworks for acoustic scene
classification
- Authors: Lam Pham, Dat Ngo, Anahid Jalali, Alexander Schindler
- Abstract summary: We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
- Score: 64.22762153453175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we presents low-complexity deep learning frameworks for
acoustic scene classification (ASC). The proposed frameworks can be separated
into four main steps: Front-end spectrogram extraction, online data
augmentation, back-end classification, and late fusion of predicted
probabilities. In particular, we initially transform audio recordings into Mel,
Gammatone, and CQT spectrograms. Next, data augmentation methods of Random
Cropping, Specaugment, and Mixup are then applied to generate augmented
spectrograms before being fed into deep learning based classifiers. Finally, to
achieve the best performance, we fuse probabilities which obtained from three
individual classifiers, which are independently-trained with three type of
spectrograms. Our experiments conducted on DCASE 2022 Task 1 Development
dataset have fullfiled the requirement of low-complexity and achieved the best
classification accuracy of 60.1%, improving DCASE baseline by 17.2%.
Related papers
- Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Exploring Meta Information for Audio-based Zero-shot Bird Classification [113.17261694996051]
This study investigates how meta-information can improve zero-shot audio classification.
We use bird species as an example case study due to the availability of rich and diverse meta-data.
arXiv Detail & Related papers (2023-09-15T13:50:16Z) - Improving Primate Sounds Classification using Binary Presorting for Deep
Learning [6.044912425856236]
In this work, we introduce a generalized approach that first relabels subsegments of MEL spectrogram representations.
For both the binary pre-sorting and the classification, we make use of convolutional neural networks (CNN) and various data-augmentation techniques.
We showcase the results of this approach on the challenging textitComparE 2021 dataset, with the task of classifying between different primate species sounds.
arXiv Detail & Related papers (2023-06-28T09:35:09Z) - Improved Zero-Shot Audio Tagging & Classification with Patchout
Spectrogram Transformers [7.817685358710508]
Zero-Shot (ZS) learning overcomes restriction by predicting classes based on adaptable class descriptions.
This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning.
arXiv Detail & Related papers (2022-08-24T09:48:22Z) - Improving Post-Processing of Audio Event Detectors Using Reinforcement
Learning [5.758073912084364]
We employ reinforcement learning to jointly discover the optimal parameters for various stages of a post-processing stack.
We find that we can improve the audio event-based macro F1-score by 4-5% compared to using the same post-processing stack with manually tuned parameters.
arXiv Detail & Related papers (2022-08-19T08:00:26Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.