1-D CNN based Acoustic Scene Classification via Reducing Layer-wise
Dimensionality
- URL: http://arxiv.org/abs/2204.00555v1
- Date: Thu, 31 Mar 2022 02:00:31 GMT
- Title: 1-D CNN based Acoustic Scene Classification via Reducing Layer-wise
Dimensionality
- Authors: Arshdeep Singh
- Abstract summary: This paper presents an alternate representation framework to commonly used time-frequency representation for acoustic scene classification (ASC)
A raw audio signal is represented using a pre-trained convolutional neural network (CNN) using its various intermediate layers.
The proposed framework outperforms the time-frequency representation based methods.
- Score: 2.5382095320488665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an alternate representation framework to commonly used
time-frequency representation for acoustic scene classification (ASC). A raw
audio signal is represented using a pre-trained convolutional neural network
(CNN) using its various intermediate layers. The study assumes that the
representations obtained from the intermediate layers lie in low-dimensions
intrinsically. To obtain low-dimensional embeddings, principal component
analysis is performed, and the study analyzes that only a few principal
components are significant. However, the appropriate number of significant
components are not known. To address this, an automatic dictionary learning
framework is utilized that approximates the underlying subspace. Further, the
low-dimensional embeddings are aggregated in a late-fusion manner in the
ensemble framework to incorporate hierarchical information learned at various
intermediate layers. The experimental evaluation is performed on publicly
available DCASE 2017 and 2018 ASC datasets on a pre-trained 1-D CNN, SoundNet.
Empirically, it is observed that deeper layers show more compression ratio than
others. At 70% compression ratio across different datasets, the performance is
similar to that obtained without performing any dimensionality reduction. The
proposed framework outperforms the time-frequency representation based methods.
Related papers
- On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We propose an aligned training approach to enhance the similarity between internal representations.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - Large-Margin Representation Learning for Texture Classification [67.94823375350433]
This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification.
The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.
arXiv Detail & Related papers (2022-06-17T04:07:45Z) - Insights on Neural Representations for End-to-End Speech Recognition [28.833851817220616]
End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation.
Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models.
This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches.
arXiv Detail & Related papers (2022-05-19T10:19:32Z) - Deep Neural Decision Forest for Acoustic Scene Classification [45.886356124352226]
Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment.
We propose a novel approach for ASC using deep neural decision forest (DNDF)
arXiv Detail & Related papers (2022-03-07T14:39:42Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z) - Temporal Bilinear Encoding Network of Audio-Visual Features at Low
Sampling Rates [7.1273332508471725]
This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate.
We propose Temporal Bilinear Networks (TBEN) for encoding both audio and visual long range temporal information.
arXiv Detail & Related papers (2020-12-18T14:59:34Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Capturing scattered discriminative information using a deep architecture
in acoustic scene classification [49.86640645460706]
In this study, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem.
We adopt a max feature map method to replace conventional non-linear activations in a deep neural network.
Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the system's discriminative power.
arXiv Detail & Related papers (2020-07-09T08:32:06Z) - MetricUNet: Synergistic Image- and Voxel-Level Learning for Precise CT
Prostate Segmentation via Online Sampling [66.01558025094333]
We propose a two-stage framework, with the first stage to quickly localize the prostate region and the second stage to precisely segment the prostate.
We introduce a novel online metric learning module through voxel-wise sampling in the multi-task network.
Our method can effectively learn more representative voxel-level features compared with the conventional learning methods with cross-entropy or Dice loss.
arXiv Detail & Related papers (2020-05-15T10:37:02Z) - On the Texture Bias for Few-Shot CNN Segmentation [21.349705243254423]
Convolutional Neural Networks (CNNs) are driven by shapes to perform visual recognition tasks.
Recent evidence suggests texture bias in CNNs provides higher performing models when learning on large labeled training datasets.
We propose a novel architecture that integrates a set of Difference of Gaussians (DoG) to attenuate high-frequency local components in the feature space.
arXiv Detail & Related papers (2020-03-09T11:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.