Learning Temporal Resolution in Spectrogram for Audio Classification
- URL: http://arxiv.org/abs/2210.01719v3
- Date: Fri, 12 Jan 2024 18:35:17 GMT
- Title: Learning Temporal Resolution in Spectrogram for Audio Classification
- Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
- Abstract summary: This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification.
Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames.
Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction.
- Score: 40.80903296278466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The audio spectrogram is a time-frequency representation that has been widely
used for audio classification. One of the key attributes of the audio
spectrogram is the temporal resolution, which depends on the hop size used in
the Short-Time Fourier Transform (STFT). Previous works generally assume the
hop size should be a constant value (e.g., 10 ms). However, a fixed temporal
resolution is not always optimal for different types of sound. The temporal
resolution affects not only classification accuracy but also computational
cost. This paper proposes a novel method, DiffRes, that enables differentiable
temporal resolution modeling for audio classification. Given a spectrogram
calculated with a fixed hop size, DiffRes merges non-essential time frames
while preserving important frames. DiffRes acts as a "drop-in" module between
an audio spectrogram and a classifier and can be jointly optimized with the
classification task. We evaluate DiffRes on five audio classification tasks,
using mel-spectrograms as the acoustic features, followed by off-the-shelf
classifier backbones. Compared with previous methods using the fixed temporal
resolution, the DiffRes-based method can achieve the equivalent or better
classification accuracy with at least 25% computational cost reduction. We
further show that DiffRes can improve classification accuracy by increasing the
temporal resolution of input acoustic features, without adding to the
computational cost.
Related papers
- Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z) - Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded.
We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.