Pop Music Transformer: Beat-based Modeling and Generation of Expressive
Pop Piano Compositions
- URL: http://arxiv.org/abs/2002.00212v3
- Date: Mon, 10 Aug 2020 07:27:05 GMT
- Title: Pop Music Transformer: Beat-based Modeling and Generation of Expressive
Pop Piano Compositions
- Authors: Yu-Siang Huang, Yi-Hsuan Yang
- Abstract summary: We build a Pop Music Transformer that composes Pop piano music with better rhythmic structure than existing Transformer models.
In particular, we seek to impose a metrical structure in the input data, so that Transformers can be more easily aware of the beat-bar-phrase hierarchical structure in music.
- Score: 37.66340344198797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A great number of deep learning based models have been recently proposed for
automatic music composition. Among these models, the Transformer stands out as
a prominent approach for generating expressive classical piano performance with
a coherent structure of up to one minute. The model is powerful in that it
learns abstractions of data on its own, without much human-imposed domain
knowledge or constraints. In contrast with this general approach, this paper
shows that Transformers can do even better for music modeling, when we improve
the way a musical score is converted into the data fed to a Transformer model.
In particular, we seek to impose a metrical structure in the input data, so
that Transformers can be more easily aware of the beat-bar-phrase hierarchical
structure in music. The new data representation maintains the flexibility of
local tempo changes, and provides hurdles to control the rhythmic and harmonic
structure of music. With this approach, we build a Pop Music Transformer that
composes Pop piano music with better rhythmic structure than existing
Transformer models.
Related papers
- Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers [0.0]
In this work, we inquire if the off-the-shelf Music Transformer models perform just as well on structural similarity metrics using only unannotated MIDI information.
We show that a slight tweak to the most common representation yields small but significant improvements.
arXiv Detail & Related papers (2024-10-14T13:53:11Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Grokking of Hierarchical Structure in Vanilla Transformers [72.45375959893218]
We show that transformer language models can learn to generalize hierarchically after training for extremely long periods.
intermediate-depth models generalize better than both very deep and very shallow transformers.
arXiv Detail & Related papers (2023-05-30T04:34:13Z) - Melody Infilling with User-Provided Structural Context [37.55332319528369]
This paper proposes a novel Transformer-based model for music score infilling.
We show that the proposed model can harness the structural information effectively and generate melodies in the style of pop of higher quality.
arXiv Detail & Related papers (2022-10-06T11:37:04Z) - Compose & Embellish: Well-Structured Piano Performance Generation via A
Two-Stage Approach [36.49582705724548]
We devise a two-stage Transformer-based framework that Composes a lead sheet first, and then Embellishes it with accompaniment and expressive touches.
Our objective and subjective experiments show that Compose & Embellish shrinks the gap in structureness between a current state of the art and real performances by half, and improves other musical aspects such as richness and coherence as well.
arXiv Detail & Related papers (2022-09-17T01:20:59Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - The Power of Reuse: A Multi-Scale Transformer Model for Structural
Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model.
We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level.
Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z) - Calliope -- A Polyphonic Music Transformer [9.558051115598657]
We present Calliope, a novel autoencoder model based on Transformers for the efficient modelling of multi-track sequences of polyphonic music.
Experiments show that our model is able to improve the state of the art on musical sequence reconstruction and generation.
arXiv Detail & Related papers (2021-07-08T08:18:57Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.