Related papers: MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

URL: http://arxiv.org/abs/2507.08530v1
Date: Fri, 11 Jul 2025 12:28:20 GMT
Title: MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling
Authors: Jingjing Tang, Xin Wang, Zhe Zhang, Junichi Yamagishi, Geraint Wiggins, George Fazekas,
Abstract summary: We propose MIDI-VALLE, a neural language model adapted from the VALLE framework for personalised text-to-speech synthesis.<n>VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances.<n> Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline.
Score: 32.78044321881271
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations [2.3249139042158853]
We present a neural network approach for synchronizing audio recordings of human piano performances with their corresponding loosely aligned MIDI files.<n>The proposed model achieves up to 20% higher alignment accuracy than the industry-standard Dynamic Time Warping (DTW) method.
arXiv Detail & Related papers (2025-06-27T13:59:50Z)
The GigaMIDI Dataset with Features for Expressive Music Performance Detection [5.585625844344932]
The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks.<n>This curated iteration of GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, constituting 31% of the GigaMIDI dataset.
arXiv Detail & Related papers (2025-02-24T23:39:40Z)
Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement [0.0]
CoSaRef is a MIDI-to-audio synthesis method that does not require MIDI-audio paired datasets.<n>It generates a synthetic audio track using concatenative synthesis based on MIDI input, then refines it with a diffusion-based deep generative model trained on datasets without MIDI annotations.<n>It allows detailed control over timbres and expression through audio sample selection and extra MIDI design, similar to traditional functions in digital audio workstations.
arXiv Detail & Related papers (2024-10-22T08:01:40Z)
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody [61.147446955297625]
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
arXiv Detail & Related papers (2024-07-02T08:23:38Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling [6.256118777336895]
Musical expression requires control of both what notes are played, and how they are performed. We introduce MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. We demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence.
arXiv Detail & Related papers (2021-12-17T04:15:42Z)
BERT-like Pre-training for Symbolic Piano Music Classification Tasks [15.02723006489356]
This article presents a benchmark study of symbolic piano music classification using the Bidirectional Representations from Transformers (BERT) approach. We pre-train two 12-layer Transformer models using the BERT approach and fine-tune them for four downstream classification tasks. Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.
arXiv Detail & Related papers (2021-07-12T07:03:57Z)
Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.