Related papers: Learning Temporal Resolution in Spectrogram for Audio Classification

Learning Temporal Resolution in Spectrogram for Audio Classification

URL: http://arxiv.org/abs/2210.01719v3
Date: Fri, 12 Jan 2024 18:35:17 GMT
Title: Learning Temporal Resolution in Spectrogram for Audio Classification
Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
Abstract summary: This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction.
Score: 40.80903296278466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations. We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z)
Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution. We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z)
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time. We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.