Related papers: Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

URL: http://arxiv.org/abs/2509.07756v2
Date: Fri, 12 Sep 2025 18:16:11 GMT
Title: Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
Authors: Friedrich Wolf-Monheim,
Abstract summary: Deep convolutional neural networks (CNNs) are widely used to classify audio data in many domains like music, speech or environmental sounds.<n>To train a specific CNN various spectral and rhythm features like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCC) are investigated.<n>The evaluated metrics accuracy, precision, recall and F1 score for multiclass classification clearly show that the mel-scaled spectrograms and the mel-frequency cepstral coefficients perform significantly better.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Next to decision tree and k-nearest neighbours algorithms deep convolutional neural networks (CNNs) are widely used to classify audio data in many domains like music, speech or environmental sounds. To train a specific CNN various spectral and rhythm features like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCC), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams can be used as digital image input data for the neural network. The performance of these spectral and rhythm features for audio category level as well as audio class level classification is investigated in detail with a deep CNN and the ESC-50 dataset with 2,000 labeled environmental audio recordings using an end-to-end deep learning pipeline. The evaluated metrics accuracy, precision, recall and F1 score for multiclass classification clearly show that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCC) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs.

Related papers

Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons [69.73249913506042]
This paper investigates a wireless split computing architecture that employs resonate-and-fire (RF) neurons to process time-domain signals directly.<n>By resonating at tunable frequencies, RF neurons extract time-localized spectral features while maintaining low spiking activity.<n> Experimental results show that the proposed RF-SNN architecture achieves comparable accuracy to conventional LIF-SNNs and ANNs.
arXiv Detail & Related papers (2025-06-24T21:14:59Z)
Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks [0.0]
Convolutional neural networks (CNNs) are widely used in computer vision. They can be used to represent spectral and rhythm features extracted from digital imagery for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs) are investigated.
arXiv Detail & Related papers (2024-10-09T14:21:59Z)
Transformer-based Sequence Labeling for Audio Classification based on MFCCs [0.0]
This paper proposes a Transformer-encoder-based model for audio classification using MFCCs. The model was benchmarked against the ESC-50, Speech Commands v0.02 and UrbanSound8k datasets and has shown strong performance. The model consisted of a mere 127,544 total parameters, making it light-weight yet highly efficient at the audio classification task.
arXiv Detail & Related papers (2023-04-30T07:25:43Z)
Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement [12.837935554250409]
We propose a novel deep learning model named Spectral Cross-domain neural network (SCDNN) It simultaneously reveal the key information embedded in spectral and time domains inside the neural network. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases textitPTB-XL and textitMIT-BIH.
arXiv Detail & Related papers (2023-01-10T14:23:43Z)
Learning Temporal Resolution in Spectrogram for Audio Classification [40.80903296278466]
This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction.
arXiv Detail & Related papers (2022-10-04T16:18:50Z)
Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features [51.924340387119415]
Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task. Our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
arXiv Detail & Related papers (2022-08-02T02:46:16Z)
A Comparative Study on Approaches to Acoustic Scene Classification using CNNs [0.0]
Different kinds of representations have dramatic effects on the accuracy of the classification. We investigated the spectrograms, MFCCs, and embeddings representations using different CNN networks and autoencoders. We found that the spectrogram representation has the highest classification accuracy while MFCC has the lowest classification accuracy.
arXiv Detail & Related papers (2022-04-26T09:23:29Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z)
SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers [91.09957836250209]
Hyperspectral (HS) images are characterized by approximately contiguous spectral information. CNNs have been proven to be a powerful feature extractor in HS image classification. We propose a novel backbone network called ulSpectralFormer for HS image classification.
arXiv Detail & Related papers (2021-07-07T02:59:21Z)
Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms. Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes. We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z)
Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment. We implement this algorithm in a real-time robotic system with a microphone array. The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.