Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization
- URL: http://arxiv.org/abs/2601.16150v1
- Date: Thu, 22 Jan 2026 17:46:31 GMT
- Title: Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization
- Authors: Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos,
- Abstract summary: We introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps.<n>We systematically evaluate this approach against prior curricula across multiple experimental axes.<n>Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics.
- Score: 2.087792589220897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
Related papers
- From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation [9.584152437544974]
This paper presents an evaluation of inter-annotator agreement in chord annotations, using metrics that extend beyond traditional binary measures.<n>We introduce a novel ACE conformer-based model that integrates consonance concepts into the model through consonance-based label smoothing.
arXiv Detail & Related papers (2025-09-01T16:20:47Z) - Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z) - Adaptive Accompaniment with ReaLchords [60.690020661819055]
We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody.<n>We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use.
arXiv Detail & Related papers (2025-06-17T16:59:05Z) - Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation [51.67328507400985]
In noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators.<n>This paper focuses on learning from partial labels annotated by pre-trained vision-language models.<n>It proposes an innovative collaborative consistency regularization (Co-Reg) method.
arXiv Detail & Related papers (2025-06-03T12:48:54Z) - Toward Fully Self-Supervised Multi-Pitch Estimation [21.000057864087164]
We present a suite of self-supervised learning objectives for multi-pitch estimation.
These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly.
Our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
arXiv Detail & Related papers (2024-02-23T19:12:41Z) - MelodyGLM: Multi-task Pre-training for Symbolic Melody Generation [39.892059799407434]
MelodyGLM is a multi-task pre-training framework for generating melodies with long-term structure.
We have constructed a large-scale symbolic melody dataset, MelodyNet, containing more than 0.4 million melody pieces.
arXiv Detail & Related papers (2023-09-19T16:34:24Z) - SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance [88.0355290619761]
This work focuses on the separation of unknown musical instruments.
We propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories.
Our framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-03-25T09:42:11Z) - A-Muze-Net: Music Generation by Composing the Harmony based on the
Generated Melody [91.22679787578438]
We present a method for the generation of Midi files of piano music.
The method models the right and left hands using two networks, where the left hand is conditioned on the right hand.
The Midi is represented in a way that is invariant to the musical scale, and the melody is represented, for the purpose of conditioning the harmony.
arXiv Detail & Related papers (2021-11-25T09:45:53Z) - BacHMMachine: An Interpretable and Scalable Model for Algorithmic
Harmonization for Four-part Baroque Chorales [23.64897650817862]
BacHMMachine employs a "theory-driven" framework guided by music composition principles.
It provides a probabilistic framework for learning key modulations and chordal progressions from a given melodic line.
It results in vast decreases in computational burden and greater interpretability.
arXiv Detail & Related papers (2021-09-15T23:39:45Z) - Differential Music: Automated Music Generation Using LSTM Networks with
Representation Based on Melodic and Harmonic Intervals [0.0]
This paper presents a generative AI model for automated music composition with LSTM networks.
It takes a novel approach at encoding musical information which is based on movement in music rather than absolute pitch.
Experimental results show promise as they sound musical and tonal.
arXiv Detail & Related papers (2021-08-23T23:51:08Z) - SongMASS: Automatic Song Writing with Pre-training and Alignment
Constraint [54.012194728496155]
SongMASS is proposed to overcome the challenges of lyric-to-melody generation and melody-to-lyric generation.
It leverages masked sequence to sequence (MASS) pre-training and attention based alignment modeling.
We show that SongMASS generates lyric and melody with significantly better quality than the baseline method.
arXiv Detail & Related papers (2020-12-09T16:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.