Related papers: PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

URL: http://arxiv.org/abs/2411.02551v2
Date: Thu, 07 Nov 2024 07:18:51 GMT
Title: PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
Authors: Hayeon Bang, Eunjin Choi, Megan Finch, Seungheon Doh, Seolhee Lee, Gyeong-Hoon Lee, Juhan Nam,
Abstract summary: PIAST (PIano dataset with Audio, Symbolic, and Text) is a piano music dataset. We collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models.
Score: 8.382511298208003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

Related papers

Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music [50.87225308217594]
This paper presents an unsupervised machine learning algorithm that identifies recurring patterns -- referred to as music-words'' -- from symbolic music data.<n>We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework.
arXiv Detail & Related papers (2025-09-29T11:10:57Z)
PianoVAM: A Multimodal Piano Performance Dataset [56.318475235705954]
PianoVAM is a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata.<n>The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions.<n>Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm.
arXiv Detail & Related papers (2025-09-10T17:35:58Z)
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z)
Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling [1.3812010983144802]
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes.<n>The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet.<n>The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio.
arXiv Detail & Related papers (2025-04-21T12:59:40Z)
JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata [6.230204066837519]
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata.
arXiv Detail & Related papers (2025-02-11T11:12:19Z)
Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval [7.7464988473650935]
Text-to-Music Retrieval plays a pivotal role in content discovery within extensive music databases. This paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++.
arXiv Detail & Related papers (2024-10-04T09:33:34Z)
Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z)
MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing [3.3162176082220975]
We present the MOSA (Music mOtion with Semantic ) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date.
arXiv Detail & Related papers (2024-06-10T15:37:46Z)
MidiCaps: A large-scale MIDI dataset with text captions [6.806050368211496]
This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions.
arXiv Detail & Related papers (2024-06-04T12:21:55Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio. We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z)
A Dataset for Greek Traditional and Folk Music: Lyra [69.07390994897443]
This paper presents a dataset for Greek Traditional and Folk music that includes 1570 pieces, summing in around 80 hours of data. The dataset incorporates YouTube timestamped links for retrieving audio and video, along with rich metadata information with regards to instrumentation, geography and genre.
arXiv Detail & Related papers (2022-11-21T14:15:43Z)
A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition [76.65908232134203]
Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks.
arXiv Detail & Related papers (2022-01-15T07:45:10Z)
dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains. This will also provide a means for evaluating algorithms specifically designed for music. The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.