Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
- URL: http://arxiv.org/abs/2512.02652v1
- Date: Tue, 02 Dec 2025 11:13:29 GMT
- Title: Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
- Authors: Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li,
- Abstract summary: Pianist Transformer is a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation.<n>It achieves strong objective metrics and human-level subjective ratings.<n>Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
- Score: 26.885642751756695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
Related papers
- SyMuPe: Affective and Controllable Symbolic Music Performance [0.00746020873338928]
We present SyMuPe, a novel framework for developing and training affective and controllable piano performance models.<n>Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks.<n>For emotion control, we present and analyze samples generated under different text conditioning scenarios.
arXiv Detail & Related papers (2025-11-05T12:42:08Z) - Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation [56.318475235705954]
We present an integrated web toolkit comprising two graphical user interfaces (GUIs)<n>PiaRec supports the synchronized acquisition of audio, video, MIDI, and performance metadata.<n> ASDF enables the efficient annotation of performer fingering from the visual data.
arXiv Detail & Related papers (2025-09-18T17:59:24Z) - Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music [47.95375326361059]
We introduce Amadeus, a novel symbolic music generation framework.<n>Amadeus adopts an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes.<n>We conduct extensive experiments on unconditional and text-conditioned generation tasks.
arXiv Detail & Related papers (2025-08-28T11:15:44Z) - Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z) - FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance [15.909113091360206]
Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR.
In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music.
arXiv Detail & Related papers (2024-10-08T08:21:05Z) - End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files.
We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data.
Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z) - Reconstructing Human Expressiveness in Piano Performances with a
Transformer Network [1.5883812630616518]
We propose a novel approach for reconstructing human expressiveness in piano performance with a multi-layer bi-directional Transformer encoder.
To address the needs for accurately captured and score-aligned performance data in training neural networks, we use transcribed scores obtained from an existing transcription model to train our model.
arXiv Detail & Related papers (2023-06-09T17:05:53Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - BERT-like Pre-training for Symbolic Piano Music Classification Tasks [15.02723006489356]
This article presents a benchmark study of symbolic piano music classification using the Bidirectional Representations from Transformers (BERT) approach.
We pre-train two 12-layer Transformer models using the BERT approach and fine-tune them for four downstream classification tasks.
Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.
arXiv Detail & Related papers (2021-07-12T07:03:57Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.