Related papers: Etude: Piano Cover Generation with a Three-Stage Approach -- Extract, strucTUralize, and DEcode

Etude: Piano Cover Generation with a Three-Stage Approach -- Extract, strucTUralize, and DEcode

URL: http://arxiv.org/abs/2509.16522v1
Date: Sat, 20 Sep 2025 04:06:43 GMT
Title: Etude: Piano Cover Generation with a Three-Stage Approach -- Extract, strucTUralize, and DEcode
Authors: Tse-Yang Che, Yuh-Jzer Joung,
Abstract summary: Piano cover generation aims to automatically transform a pop song into a piano arrangement.<n>Existing models often fail to maintain structural consistency with the original song.<n>Rhythmic information is crucial, as it defines structural similarity.<n>Our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers.

Related papers

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control [66.46754271097555]
We release a fully open-source system for long-form song generation with fine-grained style conditioning.<n>The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions.<n>We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens.
arXiv Detail & Related papers (2026-01-07T14:40:48Z)
ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis [30.70586380345095]
We present a novel generative music framework that incorporates Schenkerian analysis (SchA) in concert with a diffusion modeling framework.<n>Results from human experiments suggest superior performance to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-10-11T15:06:56Z)
Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training [4.7205815347741185]
We investigate how general music knowledge learned from a broad corpus can enhance the mastery of specific composer styles.<n>First, we pre-train a REMI-based music generation model on a large corpus of pop, folk, and classical music.<n>Then, we fine-tune it on a small, human-verified dataset from four renowned composers, namely Bach, Mozart, Beethoven, and Chopin.
arXiv Detail & Related papers (2025-06-20T22:20:59Z)
Synthesizing Composite Hierarchical Structure from Symbolic Music Corpora [32.18458296933001]
We propose a unified, hierarchical meta-representation of musical structure called the structural temporal graph (STG)<n>For a single piece, the STG is a data structure that defines a hierarchy of progressively finer structural musical features and the temporal relationships between them.
arXiv Detail & Related papers (2025-02-21T02:32:29Z)
MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens. We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z)
Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation [2.8062498505437055]
Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure. We outline several key future directions to realize the synergistic benefits of combining approaches from all eras examined.
arXiv Detail & Related papers (2024-03-12T18:03:08Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
A framework to compare music generative models using automatic evaluation metrics extended to rhythm [69.2737664640826]
This paper takes the framework proposed in a previous research that did not consider rhythm to make a series of design decisions, then, rhythm support is added to evaluate the performance of two RNN memory cells in the creation of monophonic music. The model considers the handling of music transposition and the framework evaluates the quality of the generated pieces using automatic quantitative metrics based on geometry which have rhythm support added as well.
arXiv Detail & Related papers (2021-01-19T15:04:46Z)
Sequence Generation using Deep Recurrent Networks and Embeddings: A study case in music [69.2737664640826]
This paper evaluates different types of memory mechanisms (memory cells) and analyses their performance in the field of music composition. A set of quantitative metrics is presented to evaluate the performance of the proposed architecture automatically.
arXiv Detail & Related papers (2020-12-02T14:19:19Z)
Learning Interpretable Representation for Controllable Polyphonic Music Generation [5.01266258109807]
We design a novel architecture, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications.
arXiv Detail & Related papers (2020-08-17T07:11:16Z)
Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions [14.786601824794369]
We present a model for composing melodies given a user specified symbolic scenario combined with a previous music context. Our model is capable of generating long melodies by regarding 8-beat note sequences as basic units, and shares consistent rhythm pattern structure with another specific song. Results show that the music generated by our model tends to have salient repetition structures, rich motives, and stable rhythm patterns.
arXiv Detail & Related papers (2020-02-05T06:23:44Z)
Modeling Musical Structure with Artificial Neural Networks [0.0]
I explore the application of artificial neural networks to different aspects of musical structure modeling. I show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals.
arXiv Detail & Related papers (2020-01-06T18:35:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.