Related papers: PianoVAM: A Multimodal Piano Performance Dataset

PianoVAM: A Multimodal Piano Performance Dataset

URL: http://arxiv.org/abs/2509.08800v1
Date: Wed, 10 Sep 2025 17:35:58 GMT
Title: PianoVAM: A Multimodal Piano Performance Dataset
Authors: Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, Juhan Nam,
Abstract summary: PianoVAM is a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata.<n>The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions.<n>Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm.
Score: 56.318475235705954
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

Related papers

Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation [56.318475235705954]
We present an integrated web toolkit comprising two graphical user interfaces (GUIs)<n>PiaRec supports the synchronized acquisition of audio, video, MIDI, and performance metadata.<n> ASDF enables the efficient annotation of performer fingering from the visual data.
arXiv Detail & Related papers (2025-09-18T17:59:24Z)
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks [6.895278984923356]
Chain-of-Perform (CoP) benchmark is a fully open-sourced, multimodal benchmark for video-guided piano music generation.<n>CoP benchmark offers detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio.<n> dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano.
arXiv Detail & Related papers (2025-05-26T14:24:19Z)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text [8.382511298208003]
PIAST (PIano dataset with Audio, Symbolic, and Text) is a piano music dataset. We collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models.
arXiv Detail & Related papers (2024-11-04T19:34:13Z)
FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance [15.909113091360206]
Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR. In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music.
arXiv Detail & Related papers (2024-10-08T08:21:05Z)
PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance [15.21347897534943]
We construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing.<n>To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses.
arXiv Detail & Related papers (2024-06-13T17:05:23Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
At Your Fingertips: Extracting Piano Fingering Instructions from Videos [45.643494669796866]
We consider the AI task of automating the extraction of fingering information from videos. We show how to perform this task with high-accuracy using a combination of deep-learning modules. We run the resulting system on 90 videos, resulting in high-quality piano fingering information of 150K notes.
arXiv Detail & Related papers (2023-03-07T09:09:13Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.