Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models
- URL: http://arxiv.org/abs/2310.14044v2
- Date: Tue, 3 Sep 2024 19:14:25 GMT
- Title: Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models
- Authors: Jincheng Zhang, György Fazekas, Charalampos Saitis,
- Abstract summary: We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music.
The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook.
The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder.
- Score: 5.083504224028769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE's discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at https://github.com/jinchengzhanggg/VQVAE-Diffusion.
Related papers
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - An Independence-promoting Loss for Music Generation with Language Models [64.95095558672996]
Music generation schemes rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder.
We introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation.
arXiv Detail & Related papers (2024-06-04T13:44:39Z) - Fast Diffusion GAN Model for Symbolic Music Generation Controlled by
Emotions [1.6004393678882072]
We propose a diffusion model combined with a Generative Adversarial Network to generate discrete symbolic music.
We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels.
Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion.
arXiv Detail & Related papers (2023-10-21T15:35:43Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Generating symbolic music using diffusion models [0.0]
A diffusion model that uses a binomial prior distribution to generate piano rolls is proposed.
The generated music has coherence at time scales up to the length of the training piano roll segments.
The code is publicly shared to encourage the use and development of the method by the community.
arXiv Detail & Related papers (2023-03-15T06:01:02Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - Denoising Diffusion Probabilistic Models [91.94962645056896]
We present high quality image synthesis results using diffusion probabilistic models.
Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics.
arXiv Detail & Related papers (2020-06-19T17:24:44Z) - Depthwise Discrete Representation Learning [2.728575246952532]
Recent advancements in learning Discrete Representations have led to state of art results in tasks that involve Language, Audio and Vision.
Some latent factors such as words, phonemes and shapes are better represented by discrete latent variables as opposed to continuous.
Vector Quantized Variational Autoencoders (VQVAE) have produced remarkable results in multiple domains.
arXiv Detail & Related papers (2020-04-11T18:57:13Z) - Continuous Melody Generation via Disentangled Short-Term Representations
and Structural Conditions [14.786601824794369]
We present a model for composing melodies given a user specified symbolic scenario combined with a previous music context.
Our model is capable of generating long melodies by regarding 8-beat note sequences as basic units, and shares consistent rhythm pattern structure with another specific song.
Results show that the music generated by our model tends to have salient repetition structures, rich motives, and stable rhythm patterns.
arXiv Detail & Related papers (2020-02-05T06:23:44Z) - Learning Style-Aware Symbolic Music Representations by Adversarial
Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information.
We introduce the first Music Adversarial Autoencoder (MusAE)
Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.