DISCO-10M: A Large-Scale Music Dataset
- URL: http://arxiv.org/abs/2306.13512v2
- Date: Thu, 5 Oct 2023 09:45:00 GMT
- Title: DISCO-10M: A Large-Scale Music Dataset
- Authors: Luca A. Lanzend\"orfer, Florian Gr\"otschla, Emil Funke, Roger
Wattenhofer
- Abstract summary: We present DISCO-10M, a novel and extensive music dataset.
It surpasses the largest previously available music dataset by an order of magnitude.
We aim to democratize and facilitate new research to help advance the development of novel machine learning models for music.
- Score: 20.706469085872516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music datasets play a crucial role in advancing research in machine learning
for music. However, existing music datasets suffer from limited size,
accessibility, and lack of audio resources. To address these shortcomings, we
present DISCO-10M, a novel and extensive music dataset that surpasses the
largest previously available music dataset by an order of magnitude. To ensure
high-quality data, we implement a multi-stage filtering process. This process
incorporates similarities based on textual descriptions and audio embeddings.
Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M,
facilitating direct application on various downstream tasks. These embeddings
enable efficient exploration of machine learning applications on the provided
data. With DISCO-10M, we aim to democratize and facilitate new research to help
advance the development of novel machine learning models for music.
Related papers
- Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment [0.0]
We propose a well streamlined and efficient method for generating datasets for any instrument.
The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms.
This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.
arXiv Detail & Related papers (2024-08-27T09:06:29Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music.
The data is sourced from Wikipedia's rich catalogue of articles covering musical works.
We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z) - Proceedings of the 5th International Workshop on Reading Music Systems [57.35718206110128]
5th International Workshop on Reading Music Systems held in Milan, Italy on Nov. 4th 2023.
Workshop tries to connect researchers who develop systems for reading music, with other researchers and practitioners that could benefit from such systems.
arXiv Detail & Related papers (2023-11-07T16:00:42Z) - Related Rhythms: Recommendation System To Discover Music You May Like [2.7152798636894193]
In this paper, a distributed Machine Learning pipeline is delineated, which is capable of taking a subset of songs as input and producing a new subset of songs identified as being similar to the inputted subset.
The publicly accessible Million Songs dataset (MSD) enables researchers to develop and explore reasonably efficient systems for audio track analysis and recommendations.
The objective of the proposed application is to leverage an ML system trained to optimally recommend songs that a user might like.
arXiv Detail & Related papers (2023-09-24T04:18:40Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses.
We train it on an extensive music caption dataset and fine-tune it with instructional data.
Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - A Dataset for Greek Traditional and Folk Music: Lyra [69.07390994897443]
This paper presents a dataset for Greek Traditional and Folk music that includes 1570 pieces, summing in around 80 hours of data.
The dataset incorporates YouTube timestamped links for retrieving audio and video, along with rich metadata information with regards to instrumentation, geography and genre.
arXiv Detail & Related papers (2022-11-21T14:15:43Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.