Training chord recognition models on artificially generated audio
- URL: http://arxiv.org/abs/2508.05878v1
- Date: Thu, 07 Aug 2025 22:01:58 GMT
- Title: Training chord recognition models on artificially generated audio
- Authors: Martyna Majchrzak, Jacek MaĆdziuk,
- Abstract summary: This study compares two Transformer-based neural network models for chord sequence recognition in audio recordings.<n>Experiments prove that even though there are differences in complexity and structure between artificially generated and human-composed music, the former can be useful in certain scenarios.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the challenging problems in Music Information Retrieval is the acquisition of enough non-copyrighted audio recordings for model training and evaluation. This study compares two Transformer-based neural network models for chord sequence recognition in audio recordings and examines the effectiveness of using an artificially generated dataset for this purpose. The models are trained on various combinations of Artificial Audio Multitracks (AAM), Schubert's Winterreise Dataset, and the McGill Billboard Dataset and evaluated with three metrics: Root, MajMin and Chord Content Metric (CCM). The experiments prove that even though there are certainly differences in complexity and structure between artificially generated and human-composed music, the former can be useful in certain scenarios. Specifically, AAM can enrich a smaller training dataset of music composed by a human or can even be used as a standalone training set for a model that predicts chord sequences in pop music, if no other data is available.
Related papers
- Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control [66.46754271097555]
We release a fully open-source system for long-form song generation with fine-grained style conditioning.<n>The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions.<n>We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens.
arXiv Detail & Related papers (2026-01-07T14:40:48Z) - Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset [0.29998889086656577]
We show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music.<n>We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling.
arXiv Detail & Related papers (2025-02-10T11:30:35Z) - Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [3.8570045844185237]
We present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset.
Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems.
We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix.
arXiv Detail & Related papers (2024-08-05T14:34:40Z) - Naturalistic Music Decoding from EEG Data via Latent Diffusion Models [14.882764251306094]
This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data.<n>We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics.
arXiv Detail & Related papers (2024-05-15T03:26:01Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix [11.54047475139282]
State of the art sound event retrieval models have focused on single-label audio recordings.
We propose different Deep Learning architectures with a Siamese-structure and a Pairwise Presence Matrix.
The networks are trained and evaluated using the SONYC-UST dataset containing both single- and multi-label soundscape recordings.
arXiv Detail & Related papers (2020-02-20T21:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.