MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just
One Transformer VAE
- URL: http://arxiv.org/abs/2105.04090v1
- Date: Mon, 10 May 2021 03:44:03 GMT
- Title: MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just
One Transformer VAE
- Authors: Shih-Lun Wu, Yi-Hsuan Yang
- Abstract summary: Transformer and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation.
In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths.
Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.
- Score: 36.9033909878202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers and variational autoencoders (VAE) have been extensively
employed for symbolic (e.g., MIDI) domain music generation. While the former
boast an impressive capability in modeling long sequences, the latter allow
users to willingly exert control over different parts (e.g., bars) of the music
to be generated. In this paper, we are interested in bringing the two together
to construct a single model that exhibits both strengths. The task is split
into two steps. First, we equip Transformer decoders with the ability to accept
segment-level, time-varying conditions during sequence generation.
Subsequently, we combine the developed and tested in-attention decoder with a
Transformer encoder, and train the resulting MuseMorphose model with the VAE
objective to achieve style transfer of long musical pieces, in which users can
specify musical attributes including rhythmic intensity and polyphony (i.e.,
harmonic fullness) they desire, down to the bar level. Experiments show that
MuseMorphose outperforms recurrent neural network (RNN) based prior art on
numerous widely-used metrics for style transfer tasks.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations [0.3683202928838613]
Cadenza is a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas.
The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer.
Our framework is designed, researched and implemented with the objective of providing inspiration for musicians.
arXiv Detail & Related papers (2024-10-02T22:11:31Z) - BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features [19.284531698181116]
BandControlNet is designed to tackle the multiple music sequences and generate high-quality music samples conditioned to the giventemporal control features.
The proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed.
The subjective evaluations show trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming significantly using BandControlNet.
arXiv Detail & Related papers (2024-07-15T06:33:25Z) - Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long
Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music.
We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z) - MAGMA: Music Aligned Generative Motion Autodecoder [15.825872274297735]
We introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE)
We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms.
Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences.
arXiv Detail & Related papers (2023-09-03T15:21:47Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation [138.74751744348274]
We propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation.
Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures.
With the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost.
arXiv Detail & Related papers (2022-10-19T07:31:56Z) - The Power of Reuse: A Multi-Scale Transformer Model for Structural
Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model.
We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level.
Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - MMM : Exploring Conditional Multi-Track Music Generation with the
Transformer [9.569049935824227]
We propose a generative system based on the Transformer architecture that is capable of generating multi-track music.
We create a time-ordered sequence of musical events for each track and several tracks into a single sequence.
This takes advantage of the Transformer's attention-mechanism, which can adeptly handle long-term dependencies.
arXiv Detail & Related papers (2020-08-13T02:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.