A Novel Audio Representation for Music Genre Identification in MIR
- URL: http://arxiv.org/abs/2404.01058v1
- Date: Mon, 1 Apr 2024 11:40:09 GMT
- Title: A Novel Audio Representation for Music Genre Identification in MIR
- Authors: Navin Kamuni, Mayank Jindal, Arpita Soni, Sukender Reddy Mallreddy, Sharath Chandra Macha,
- Abstract summary: For Music Information Retrieval downstream tasks, the most common audio representation is time-frequency-based, such as Mel spectrograms.
This study explores the possibilities of a new form of audio representation one of the most usual MIR downstream tasks.
A novel audio representation was created for the innovative generative music model i.e. Jukebox.
The effectiveness of Jukebox's audio representation is compared to Mel spectrograms using a dataset that is almost equivalent to State-of-the-Art (SOTA) and an almost same transformer design.
- Score: 3.203495505471781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For Music Information Retrieval downstream tasks, the most common audio representation is time-frequency-based, such as Mel spectrograms. In order to identify musical genres, this study explores the possibilities of a new form of audio representation one of the most usual MIR downstream tasks. Therefore, to discretely encoding music using deep vector quantization; a novel audio representation was created for the innovative generative music model i.e. Jukebox. The effectiveness of Jukebox's audio representation is compared to Mel spectrograms using a dataset that is almost equivalent to State-of-the-Art (SOTA) and an almost same transformer design. The results of this study imply that, at least when the transformers are pretrained using a very modest dataset of 20k tracks, Jukebox's audio representation is not superior to Mel spectrograms. This could be explained by the fact that Jukebox's audio representation does not sufficiently take into account the peculiarities of human hearing perception. On the other hand, Mel spectrograms are specifically created with the human auditory sense in mind.
Related papers
- The Rarity of Musical Audio Signals Within the Space of Possible Audio Generation [0.0]
A white noise signal can access any possible configuration of values, though statistically over many samples tends to a uniform spectral distribution.
The probability that white noise generates a music-like signal over different durations is analyzed.
The applicability of this study is not just to show that music has a precious rarity value, but that examination of the size of music relative to the overall size of audio signal space provides information to inform new generations of algorithmic music system.
arXiv Detail & Related papers (2024-05-23T23:25:46Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Codified audio language modeling learns useful representations for music
information retrieval [77.63657430536593]
We show that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks.
To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks.
We observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches.
arXiv Detail & Related papers (2021-07-12T18:28:50Z) - MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training [97.91071692716406]
Symbolic music understanding refers to the understanding of music from the symbolic data.
MusicBERT is a large-scale pre-trained model for music understanding.
arXiv Detail & Related papers (2021-06-10T10:13:05Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.