PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations
- URL: http://arxiv.org/abs/2410.02060v1
- Date: Wed, 2 Oct 2024 22:11:31 GMT
- Title: PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations
- Authors: Julian Lenz, Anirudh Mani,
- Abstract summary: Cadenza is a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas.
The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer.
Our framework is designed, researched and implemented with the objective of providing inspiration for musicians.
- Score: 0.3683202928838613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human evaluations demonstrate Cadenza's versatile capability in 1) matching other unconditional state-of-the-art symbolic models in musical quality whilst sounding more expressive, and 2) composing new, expressive ideas that are both stylistically related to the input whilst providing novel ideas to the user. Our framework is designed, researched and implemented with the objective of ethically providing inspiration for musicians.
Related papers
- MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long
Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music.
We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Mel Spectrogram Inversion with Stable Pitch [0.0]
Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform.
Recent vocoder models developed for speech achieve a high degree of realism.
Compared to speech, the structure of the musical sound texture offers new challenges.
arXiv Detail & Related papers (2022-08-26T17:01:57Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - An Empirical Evaluation of End-to-End Polyphonic Optical Music
Recognition [24.377724078096144]
Piano and orchestral scores frequently exhibit polyphonic passages, which add a second dimension to the task.
We propose two novel formulations for end-to-end polyphonic OMR.
We observe a new state-of-the-art performance with our multi-sequence detection decoder, RNNDecoder.
arXiv Detail & Related papers (2021-08-03T22:04:40Z) - MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just
One Transformer VAE [36.9033909878202]
Transformer and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation.
In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths.
Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.
arXiv Detail & Related papers (2021-05-10T03:44:03Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Continuous Melody Generation via Disentangled Short-Term Representations
and Structural Conditions [14.786601824794369]
We present a model for composing melodies given a user specified symbolic scenario combined with a previous music context.
Our model is capable of generating long melodies by regarding 8-beat note sequences as basic units, and shares consistent rhythm pattern structure with another specific song.
Results show that the music generated by our model tends to have salient repetition structures, rich motives, and stable rhythm patterns.
arXiv Detail & Related papers (2020-02-05T06:23:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.