Compound Word Transformer: Learning to Compose Full-Song Music over
Dynamic Directed Hypergraphs
- URL: http://arxiv.org/abs/2101.02402v1
- Date: Thu, 7 Jan 2021 06:57:34 GMT
- Title: Compound Word Transformer: Learning to Compose Full-Song Music over
Dynamic Directed Hypergraphs
- Authors: Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, Yi-Hsuan Yang
- Abstract summary: We present a conceptually different approach that takes into account the type of the tokens, such as note types and metric types.
We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs.
Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training.
- Score: 34.976342712112476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To apply neural sequence models such as the Transformers to music generation
tasks, one has to represent a piece of music by a sequence of tokens drawn from
a finite set of pre-defined vocabulary. Such a vocabulary usually involves
tokens of various types. For example, to describe a musical note, one needs
separate tokens to indicate the note's pitch, duration, velocity (dynamics),
and placement (onset time) along the time grid. While different types of tokens
may possess different properties, existing models usually treat them equally,
in the same way as modeling words in natural languages. In this paper, we
present a conceptually different approach that explicitly takes into account
the type of the tokens, such as note types and metric types. And, we propose a
new Transformer decoder architecture that uses different feed-forward heads to
model tokens of different types. With an expansion-compression trick, we
convert a piece of music to a sequence of compound words by grouping
neighboring tokens, greatly reducing the length of the token sequences. We show
that the resulting model can be viewed as a learner over dynamic directed
hypergraphs. And, we employ it to learn to compose expressive Pop piano music
of full-song length (involving up to 10K individual tokens per song), both
conditionally and unconditionally. Our experiment shows that, compared to
state-of-the-art models, the proposed model converges 5--10 times faster at
training (i.e., within a day on a single GPU with 11 GB memory), and with
comparable quality in the generated music.
Related papers
- ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Toucan: Token-Aware Character Level Language Modeling [44.85590844938571]
Toucan is an augmentation to character-level models to make them "token-aware"
We show significant speed-ups in character generation without a loss in language modeling performance.
Our approach leads to a greater amount of longer sequences tokenized as single items.
arXiv Detail & Related papers (2023-11-15T00:57:51Z) - Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - Impact of time and note duration tokenizations on deep learning symbolic
music modeling [0.0]
We analyze the common tokenization methods and experiment with time and note duration representations.
We demonstrate that explicit information leads to better results depending on the task.
arXiv Detail & Related papers (2023-10-12T16:56:37Z) - Byte Pair Encoding for Symbolic Music [0.0]
Byte Pair embeddings significantly decreases the sequence length while increasing the vocabulary size.
We leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks.
The source code is shared on Github, along with a companion website.
arXiv Detail & Related papers (2023-01-27T20:22:18Z) - Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.