Related papers: VampNet: Music Generation via Masked Acoustic Token Modeling

VampNet: Music Generation via Masked Acoustic Token Modeling

URL: http://arxiv.org/abs/2307.04686v2
Date: Wed, 12 Jul 2023 17:06:41 GMT
Title: VampNet: Music Generation via Masked Acoustic Token Modeling
Authors: Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, Bryan Pardo
Abstract summary: We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation.
Score: 11.893826325744055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation (vamping). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code and audio samples are available online.

Related papers

Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation [5.083504224028769]
We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music.<n>This study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform.<n>Our evaluation shows that our method achieves compelling results in terms of music quality and controllability.
arXiv Detail & Related papers (2025-05-06T08:44:52Z)
AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. It can generate both general audio and music with high quality, while offering flexible natural language control. To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching. We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain. WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls [6.176747724853209]
Large Language Models (LLMs) have shown promise in generating high-quality music, but their focus on autoregressive generation limits their utility in music editing tasks. We propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. Our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement.
arXiv Detail & Related papers (2024-02-14T19:00:01Z)
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z)
InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models [42.2977676825086]
In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. Our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony.
arXiv Detail & Related papers (2023-08-28T07:11:42Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z)
MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE [36.9033909878202]
Transformer and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.
arXiv Detail & Related papers (2021-05-10T03:44:03Z)
Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements [20.627164135805852]
We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. We build a pipeline named 'Multi-instrumentalistNet' that learns a discrete latent representation of various instruments music from log-spectrogram. We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video.
arXiv Detail & Related papers (2020-12-07T06:54:10Z)
Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z)
Jukebox: A Generative Model for Music [75.242747436901]
Jukebox is a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes.
arXiv Detail & Related papers (2020-04-30T09:02:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.