Acoustic Scene Classification Using Bilinear Pooling on Time-liked and
Frequency-liked Convolution Neural Network
- URL: http://arxiv.org/abs/2002.07065v1
- Date: Fri, 14 Feb 2020 04:06:32 GMT
- Title: Acoustic Scene Classification Using Bilinear Pooling on Time-liked and
Frequency-liked Convolution Neural Network
- Authors: Xing Yong Kek, Cheng Siong Chin, Ye Li
- Abstract summary: This paper explores the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio.
The deep features extracted from these 2 CNNs will then be combined using bilinear pooling.
The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of 65% on development dataset.
- Score: 4.131608702779222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current methodology in tackling Acoustic Scene Classification (ASC) task
can be described in two steps, preprocessing of the audio waveform into log-mel
spectrogram and then using it as the input representation for Convolutional
Neural Network (CNN). This paradigm shift occurs after DCASE 2016 where this
framework model achieves the state-of-the-art result in ASC tasks on the
(ESC-50) dataset and achieved an accuracy of 64.5%, which constitute to 20.5%
improvement over the baseline model, and DCASE 2016 dataset with an accuracy of
90.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9%
improvements with respect to the baseline system. In this paper, we explored
the use of harmonic and percussive source separation (HPSS) to split the audio
into harmonic audio and percussive audio, which has received popularity in the
field of music information retrieval (MIR). Although works have been done in
using HPSS as input representation for CNN model in ASC task, this paper
further investigate the possibility on leveraging the separated harmonic
component and percussive component by curating 2 CNNs which tries to understand
harmonic audio and percussive audio in their natural form, one specialized in
extracting deep features in time biased domain and another specialized in
extracting deep features in frequency biased domain, respectively. The deep
features extracted from these 2 CNNs will then be combined using bilinear
pooling. Hence, presenting a two-stream time and frequency CNN architecture
approach in classifying acoustic scene. The model is being evaluated on DCASE
2019 sub task 1a dataset and scored an average of 65% on development dataset,
Kaggle Leadership Private and Public board.
Related papers
- Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Low-complexity deep learning frameworks for acoustic scene
classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z) - AudioCLIP: Extending CLIP to Image, Text and Audio [6.585049648605185]
We present an extension of the CLIP model that handles audio in addition to text and images.
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task.
arXiv Detail & Related papers (2021-06-24T14:16:38Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - AST: Audio Spectrogram Transformer [21.46018186487818]
We introduce the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification.
AST achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
arXiv Detail & Related papers (2021-04-05T05:26:29Z) - A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.