Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning
of Music Audio
- URL: http://arxiv.org/abs/2402.09318v1
- Date: Wed, 14 Feb 2024 17:13:36 GMT
- Title: Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning
of Music Audio
- Authors: Pablo Alonso-Jim\'enez and Leonardo Pepino and Roser Batlle-Roca and
Pablo Zinemanas and Dmitry Bogdanov and Xavier Serra and Mart\'in Rocamora
- Abstract summary: We present PECMAE, an interpretable model for music audio classification based on prototype learning.
Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network.
We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings.
- Score: 10.946347283718923
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present PECMAE, an interpretable model for music audio classification
based on prototype learning. Our model is based on a previous method, APNet,
which jointly learns an autoencoder and a prototypical network. Instead, we
propose to decouple both training processes. This enables us to leverage
existing self-supervised autoencoders pre-trained on much larger data
(EnCodecMAE), providing representations with better generalization. APNet
allows prototypes' reconstruction to waveforms for interpretability relying on
the nearest training data samples. In contrast, we explore using a diffusion
decoder that allows reconstruction without such dependency. We evaluate our
method on datasets for music instrument classification (Medley-Solos-DB) and
genre recognition (GTZAN and a larger in-house dataset), the latter being a
more challenging task not addressed with prototypical networks before. We find
that the prototype-based models preserve most of the performance achieved with
the autoencoder embeddings, while the sonification of prototypes benefits
understanding the behavior of the classifier.
Related papers
- Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Transfer Learning for Passive Sonar Classification using Pre-trained Audio and ImageNet Models [39.85805843651649]
This study compares pre-trained Audio Neural Networks (PANNs) and ImageNet pre-trained models.
It was observed that the ImageNet pre-trained models slightly out-perform pre-trained audio models in passive sonar classification.
arXiv Detail & Related papers (2024-09-20T20:13:45Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Ensemble Wrapper Subsampling for Deep Modulation Classification [70.91089216571035]
Subsampling of received wireless signals is important for relaxing hardware requirements as well as the computational cost of signal processing algorithms.
We propose a subsampling technique to facilitate the use of deep learning for automatic modulation classification in wireless communication systems.
arXiv Detail & Related papers (2020-05-10T06:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.