Jukebox: A Generative Model for Music
- URL: http://arxiv.org/abs/2005.00341v1
- Date: Thu, 30 Apr 2020 09:02:45 GMT
- Title: Jukebox: A Generative Model for Music
- Authors: Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec
Radford, Ilya Sutskever
- Abstract summary: Jukebox is a model that generates music with singing in the raw audio domain.
We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes.
We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes.
- Score: 75.242747436901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Jukebox, a model that generates music with singing in the raw
audio domain. We tackle the long context of raw audio using a multi-scale
VQ-VAE to compress it to discrete codes, and modeling those using
autoregressive Transformers. We show that the combined model at scale can
generate high-fidelity and diverse songs with coherence up to multiple minutes.
We can condition on artist and genre to steer the musical and vocal style, and
on unaligned lyrics to make the singing more controllable. We are releasing
thousands of non cherry-picked samples at https://jukebox.openai.com, along
with model weights and code at https://github.com/openai/jukebox
Related papers
- SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics.
The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM.
Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z) - SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement [14.890331617779546]
Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics.
We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre.
arXiv Detail & Related papers (2024-07-10T15:00:08Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - VampNet: Music Generation via Masked Acoustic Token Modeling [11.893826325744055]
We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation.
VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass.
We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation.
arXiv Detail & Related papers (2023-07-10T16:42:03Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - SingSong: Generating musical accompaniments from singing [35.819589427197464]
We present SingSong, a system that generates instrumental music to accompany input vocals.
In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong.
arXiv Detail & Related papers (2023-01-30T04:53:23Z) - Msanii: High Fidelity Music Synthesis on a Shoestring Budget [0.0]
We present Msanii, a novel diffusion-based model for synthesizing high-fidelity music efficiently.
Our model combines the synthesis of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders.
arXiv Detail & Related papers (2023-01-16T15:18:26Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.