StemGen: A music generation model that listens
- URL: http://arxiv.org/abs/2312.08723v2
- Date: Tue, 16 Jan 2024 09:15:05 GMT
- Title: StemGen: A music generation model that listens
- Authors: Julian D. Parker, Janne Spijkervet, Katerina Kosta, Furkan Yesiler,
Boris Kuznetsov, Ju-Chiang Wang, Matt Avent, Jitong Chen, Duc Le
- Abstract summary: We present an alternative paradigm for producing music generation models that can listen and respond to musical context.
We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture.
The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.
- Score: 9.489938613869864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end generation of musical audio using deep learning techniques has
seen an explosion of activity recently. However, most models concentrate on
generating fully mixed music in response to abstract conditioning information.
In this work, we present an alternative paradigm for producing music generation
models that can listen and respond to musical context. We describe how such a
model can be constructed using a non-autoregressive, transformer-based model
architecture and present a number of novel architectural and sampling
improvements. We train the described architecture on both an open-source and a
proprietary dataset. We evaluate the produced models using standard quality
metrics and a new approach based on music information retrieval descriptors.
The resulting model reaches the audio quality of state-of-the-art
text-conditioned models, as well as exhibiting strong musical coherence with
its context.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - Stable Audio Open [8.799402694043955]
We describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data.
Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics.
arXiv Detail & Related papers (2024-07-19T14:40:23Z) - Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models [0.0]
"Diff-A-Riff" is a Latent Diffusion Model designed to generate high-quality instrumentals adaptable to any musical context.
It produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.
arXiv Detail & Related papers (2024-06-12T16:34:26Z) - A Survey of Music Generation in the Context of Interaction [3.6522809408725223]
Machine learning has been successfully used to compose and generate music, both melodies and polyphonic pieces.
Most of these models are not suitable for human-machine co-creation through live interaction.
arXiv Detail & Related papers (2024-02-23T12:41:44Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models.
Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process.
We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z) - The Power of Reuse: A Multi-Scale Transformer Model for Structural
Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model.
We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level.
Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Generative Modelling for Controllable Audio Synthesis of Expressive
Piano Performance [6.531546527140474]
controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE)
We demonstrate how the model is able to apply fine-grained style morphing over the course of the audio.
arXiv Detail & Related papers (2020-06-16T12:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.