Msanii: High Fidelity Music Synthesis on a Shoestring Budget
- URL: http://arxiv.org/abs/2301.06468v1
- Date: Mon, 16 Jan 2023 15:18:26 GMT
- Title: Msanii: High Fidelity Music Synthesis on a Shoestring Budget
- Authors: Kinyugo Maina
- Abstract summary: We present Msanii, a novel diffusion-based model for synthesizing high-fidelity music efficiently.
Our model combines the synthesis of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present Msanii, a novel diffusion-based model for
synthesizing long-context, high-fidelity music efficiently. Our model combines
the expressiveness of mel spectrograms, the generative capabilities of
diffusion models, and the vocoding capabilities of neural vocoders. We
demonstrate the effectiveness of Msanii by synthesizing tens of seconds (190
seconds) of stereo music at high sample rates (44.1 kHz) without the use of
concatenative synthesis, cascading architectures, or compression techniques. To
the best of our knowledge, this is the first work to successfully employ a
diffusion-based model for synthesizing such long music samples at high sample
rates. Our demo can be found https://kinyugo.github.io/msanii-demo and our code
https://github.com/Kinyugo/msanii .
Related papers
- Multi-Source Music Generation with Latent Diffusion [7.832209959041259]
Multi-Source Diffusion Model (MSDM) proposed to model music as a mixture of multiple instrumental sources.
MSLDM employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation.
This approach significantly enhances the total and partial generation of music.
arXiv Detail & Related papers (2024-09-10T03:41:10Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - One Billion Audio Sounds from GPU-enabled Modular Synthesis [5.5022962399775945]
synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, is 100x larger than any audio dataset in the literature.
synth1B1 samples are deterministically generated on-the-fly 16200x faster than real-time (714MHz) on a single GPU.
arXiv Detail & Related papers (2021-04-27T00:38:52Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.