A Modulation Front-End for Music Audio Tagging
- URL: http://arxiv.org/abs/2105.11836v1
- Date: Tue, 25 May 2021 11:05:24 GMT
- Title: A Modulation Front-End for Music Audio Tagging
- Authors: Cyrus Vahidi, Charalampos Saitis, Gy\"orgy Fazekas
- Abstract summary: Modulation filter bank representations have the potential to facilitate the extraction of perceptually salient features.
We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block.
We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks have been extensively explored in the task of
automatic music tagging. The problem can be approached by using either
engineered time-frequency features or raw audio as input. Modulation filter
bank representations that have been actively researched as a basis for timbre
perception have the potential to facilitate the extraction of perceptually
salient features. We explore end-to-end learned front-ends for audio
representation learning, ModNet and SincModNet, that incorporate a temporal
modulation processing block. The structure is effectively analogous to a
modulation filter bank, where the FIR filter center frequencies are learned in
a data-driven manner. The expectation is that a perceptually motivated filter
bank can provide a useful representation for identifying music features. Our
experimental results provide a fully visualisable and interpretable front-end
temporal modulation decomposition of raw audio. We evaluate the performance of
our model against the state-of-the-art of music tagging on the MagnaTagATune
dataset. We analyse the impact on performance for particular tags when
time-frequency bands are subsampled by the modulation filters at a
progressively reduced rate. We demonstrate that modulation filtering provides
promising results for music tagging and feature representation, without using
extensive musical domain knowledge in the design of this front-end.
Related papers
- The Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing [0.0]
We present The Concatenator,'' a real time system for audio-guided concatenative synthesis.
We use a particle filter to infer the best corpus states in real-time.
Our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections.
arXiv Detail & Related papers (2024-11-07T01:52:46Z) - FilterNet: Harnessing Frequency Filters for Time Series Forecasting [34.83702192033196]
FilterNet is built upon our proposed learnable frequency filters to extract key informative temporal patterns by selectively passing or attenuating certain components of time series signals.
equipped with the two filters, FilterNet can approximately surrogate the linear and attention mappings widely adopted in time series literature.
arXiv Detail & Related papers (2024-11-03T16:20:41Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - Perceptual Musical Features for Interpretable Audio Tagging [2.1730712607705485]
This study explores the relevance of interpretability in the context of automatic music tagging.
We constructed a workflow that incorporates three different information extraction techniques.
We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset.
arXiv Detail & Related papers (2023-12-18T14:31:58Z) - Content Adaptive Front End For Audio Signal Processing [2.8935588665357077]
We propose a learnable content adaptive front end for audio signal processing.
We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector.
arXiv Detail & Related papers (2023-03-18T16:09:10Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Interpretable Acoustic Representation Learning on Breathing and Speech
Signals for COVID-19 Detection [37.01066509527848]
We describe an approach for representation learning of audio signals for the task of COVID-19 detection.
The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions.
The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism.
arXiv Detail & Related papers (2022-06-27T15:20:51Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - EEGminer: Discovering Interpretable Features of Brain Activity with
Learnable Filters [72.19032452642728]
We propose a novel differentiable EEG decoding pipeline consisting of learnable filters and a pre-determined feature extraction module.
We demonstrate the utility of our model towards emotion recognition from EEG signals on the SEED dataset and on a new EEG dataset of unprecedented size.
The discovered features align with previous neuroscience studies and offer new insights, such as marked differences in the functional connectivity profile between left and right temporal areas during music listening.
arXiv Detail & Related papers (2021-10-19T14:22:04Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - iffDetector: Inference-aware Feature Filtering for Object Detection [70.8678270164057]
We introduce a generic Inference-aware Feature Filtering (IFF) module that can easily be combined with modern detectors.
IFF performs closed-loop optimization by leveraging high-level semantics to enhance the convolutional features.
IFF can be fused with CNN-based object detectors in a plug-and-play manner with negligible computational cost overhead.
arXiv Detail & Related papers (2020-06-23T02:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.