Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation
- URL: http://arxiv.org/abs/2210.10349v1
- Date: Wed, 19 Oct 2022 07:31:56 GMT
- Title: Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation
- Authors: Botao Yu, Peiling Lu, Rui Wang, Wei Hu, Xu Tan, Wei Ye, Shikun Zhang,
Tao Qin, Tie-Yan Liu
- Abstract summary: We propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation.
Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures.
With the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost.
- Score: 138.74751744348274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Symbolic music generation aims to generate music scores automatically. A
recent trend is to use Transformer or its variants in music generation, which
is, however, suboptimal, because the full attention cannot efficiently model
the typically long music sequences (e.g., over 10,000 tokens), and the existing
models have shortcomings in generating musical repetition structures. In this
paper, we propose Museformer, a Transformer with a novel fine- and
coarse-grained attention for music generation. Specifically, with the
fine-grained attention, a token of a specific bar directly attends to all the
tokens of the bars that are most relevant to music structures (e.g., the
previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with
the coarse-grained attention, a token only attends to the summarization of the
other bars rather than each token of them so as to reduce the computational
cost. The advantages are two-fold. First, it can capture both music
structure-related correlations via the fine-grained attention, and other
contextual information via the coarse-grained attention. Second, it is
efficient and can model over 3X longer music sequences compared to its
full-attention counterpart. Both objective and subjective experimental results
demonstrate its ability to generate long music sequences with high quality and
better structures.
Related papers
- PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation [5.201151187019607]
PerceiverS (Segmentation and Scale) is a novel architecture designed to generate long-structured and expressive music.
Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details.
The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music.
arXiv Detail & Related papers (2024-11-13T03:14:10Z) - MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching.
We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z) - Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers [0.0]
In this work, we inquire if the off-the-shelf Music Transformer models perform just as well on structural similarity metrics using only unannotated MIDI information.
We show that a slight tweak to the most common representation yields small but significant improvements.
arXiv Detail & Related papers (2024-10-14T13:53:11Z) - Generating High-quality Symbolic Music Using Fine-grained Discriminators [42.200747558496055]
We propose to decouple the melody and rhythm from music, and design corresponding fine-grained discriminators to tackle the issues.
Specifically, equipped with a pitch augmentation strategy, the melody discriminator discerns the melody variations presented by the generated samples.
The rhythm discriminator, enhanced with bar-level relative positional encoding, focuses on the velocity of generated notes.
arXiv Detail & Related papers (2024-08-03T07:32:21Z) - BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features [19.284531698181116]
BandControlNet is designed to tackle the multiple music sequences and generate high-quality music samples conditioned to the giventemporal control features.
The proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed.
The subjective evaluations show trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming significantly using BandControlNet.
arXiv Detail & Related papers (2024-07-15T06:33:25Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training [97.91071692716406]
Symbolic music understanding refers to the understanding of music from the symbolic data.
MusicBERT is a large-scale pre-trained model for music understanding.
arXiv Detail & Related papers (2021-06-10T10:13:05Z) - MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just
One Transformer VAE [36.9033909878202]
Transformer and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation.
In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths.
Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.
arXiv Detail & Related papers (2021-05-10T03:44:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.