Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation
- URL: http://arxiv.org/abs/2509.15222v1
- Date: Thu, 18 Sep 2025 17:59:24 GMT
- Title: Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation
- Authors: Junhyung Park, Yonghyun Kim, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, Juhan Nam,
- Abstract summary: We present an integrated web toolkit comprising two graphical user interfaces (GUIs)<n>PiaRec supports the synchronized acquisition of audio, video, MIDI, and performance metadata.<n> ASDF enables the efficient annotation of performer fingering from the visual data.
- Score: 56.318475235705954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Piano performance is a multimodal activity that intrinsically combines physical actions with the acoustic rendition. Despite growing research interest in analyzing the multimodal nature of piano performance, the laborious process of acquiring large-scale multimodal data remains a significant bottleneck, hindering further progress in this field. To overcome this barrier, we present an integrated web toolkit comprising two graphical user interfaces (GUIs): (i) PiaRec, which supports the synchronized acquisition of audio, video, MIDI, and performance metadata. (ii) ASDF, which enables the efficient annotation of performer fingering from the visual data. Collectively, this system can streamline the acquisition of multimodal piano performance datasets.
Related papers
- PianoVAM: A Multimodal Piano Performance Dataset [56.318475235705954]
PianoVAM is a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata.<n>The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions.<n>Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm.
arXiv Detail & Related papers (2025-09-10T17:35:58Z) - DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding [13.256830504062332]
We introduce DEL, a framework for dense semantic action localization.<n> DEL aims to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos.
arXiv Detail & Related papers (2025-06-29T11:50:19Z) - MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z) - Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks [6.895278984923356]
Chain-of-Perform (CoP) benchmark is a fully open-sourced, multimodal benchmark for video-guided piano music generation.<n>CoP benchmark offers detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio.<n> dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano.
arXiv Detail & Related papers (2025-05-26T14:24:19Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long
Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music.
We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.