Related papers: Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

URL: http://arxiv.org/abs/2508.20665v1
Date: Thu, 28 Aug 2025 11:15:44 GMT
Title: Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music
Authors: Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song,
Abstract summary: We introduce Amadeus, a novel symbolic music generation framework.<n>Amadeus adopts an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes.<n>We conduct extensive experiments on unconditional and text-conditioned generation tasks.
Score: 47.95375326361059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.

Related papers

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation [26.273309051211204]
Video-to-music (V2M) generation aims to create music that aligns with visual content.<n>We propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model.<n>For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF)
arXiv Detail & Related papers (2025-11-12T08:02:06Z)
IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z)
Semantic Item Graph Enhancement for Multimodal Recommendation [49.66272783945571]
Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information.<n>Prior methods often build modality-specific item-item semantic graphs from raw modality features.<n>These semantic graphs suffer from semantic deficiencies, including insufficient modeling of collaborative signals among items.
arXiv Detail & Related papers (2025-08-08T09:20:50Z)
Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation [18.318271141864297]
Session-based Recommendation (SBR) seeks to predict a user's next action based on an anonymous session. Most SBR models rely on the contextual transitions within a short session to learn item representations. We propose a model-agnostic framework, named AttrGAU, to bring the Modeling of Item Attributes's superiority into existing attribute-agnostic models.
arXiv Detail & Related papers (2024-10-14T08:49:11Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules. MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips. We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)
Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling [5.88864611435337]
We present a framework that can learn high-level feature representations with a limited amount of data. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated. We demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes.
arXiv Detail & Related papers (2020-07-29T16:01:45Z)
Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions [14.786601824794369]
We present a model for composing melodies given a user specified symbolic scenario combined with a previous music context. Our model is capable of generating long melodies by regarding 8-beat note sequences as basic units, and shares consistent rhythm pattern structure with another specific song. Results show that the music generated by our model tends to have salient repetition structures, rich motives, and stable rhythm patterns.
arXiv Detail & Related papers (2020-02-05T06:23:44Z)
Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information. We introduce the first Music Adversarial Autoencoder (MusAE) Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.