Multi-instrument Music Synthesis with Spectrogram Diffusion
- URL: http://arxiv.org/abs/2206.05408v1
- Date: Sat, 11 Jun 2022 03:26:15 GMT
- Title: Multi-instrument Music Synthesis with Spectrogram Diffusion
- Authors: Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh
Gardner, Ethan Manilow, Jesse Engel
- Abstract summary: We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
- Score: 19.81982315173444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An ideal music synthesizer should be both interactive and expressive,
generating high-fidelity audio in realtime for arbitrary combinations of
instruments and notes. Recent neural synthesizers have exhibited a tradeoff
between domain-specific models that offer detailed control of only specific
instruments, or raw waveform models that can train on all of music but with
minimal control and slow generation.
In this work, we focus on a middle ground of neural synthesizers that can
generate audio from MIDI sequences with arbitrary combinations of instruments
in realtime. This enables training on a wide range of transcription datasets
with a single model, which in turn offers note-level control of composition and
instrumentation across a wide range of instruments.
We use a simple two-stage process: MIDI to spectrograms with an
encoder-decoder Transformer, then spectrograms to audio with a generative
adversarial network (GAN) spectrogram inverter. We compare training the decoder
as an autoregressive model and as a Denoising Diffusion Probabilistic Model
(DDPM) and find that the DDPM approach is superior both qualitatively and as
measured by audio reconstruction and Fr\'echet distance metrics.
Given the interactivity and generality of this approach, we find this to be a
promising first step towards interactive and expressive neural synthesis for
arbitrary combinations of instruments and notes.
Related papers
- Synthesizer Sound Matching Using Audio Spectrogram Transformers [2.5944208050492183]
We introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer.
We show that this model can reconstruct parameters of samples generated from a set of 16 parameters.
We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations.
arXiv Detail & Related papers (2024-07-23T16:58:14Z) - DiffMoog: a Differentiable Modular Synthesizer for Sound Matching [48.33168531500444]
DiffMoog is a differentiable modular synthesizer with a comprehensive set of modules typically found in commercial instruments.
Being differentiable, it allows integration into neural networks, enabling automated sound matching.
We introduce an open-source platform that comprises DiffMoog and an end-to-end sound matching framework.
arXiv Detail & Related papers (2024-01-23T08:59:21Z) - NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound
Synthesis based on Frequency Modulation [38.00669627261736]
We propose NAS-FM'', which adopts neural architecture search (NAS) to build a differentiable frequency modulation (FM) synthesizer.
Tunable synthesizers with interpretable controls can be developed automatically from sounds without any prior expert knowledge.
arXiv Detail & Related papers (2023-05-22T09:46:10Z) - Synthesizer Preset Interpolation using Transformer Auto-Encoders [4.213427823201119]
We introduce a bimodal auto-encoder neural network, which simultaneously processes presets using multi-head attention blocks, and audio using convolutions.
This model has been tested on a popular frequency modulation synthesizer with more than one hundred parameters.
After training, the proposed model can be integrated into commercial synthesizers for live or sound design tasks.
arXiv Detail & Related papers (2022-10-27T15:20:18Z) - DDX7: Differentiable FM Synthesis of Musical Instrument Sounds [7.829520196474829]
Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs)
We present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds.
arXiv Detail & Related papers (2022-08-12T08:39:45Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Synthesizer: Rethinking Self-Attention in Transformer Models [93.08171885200922]
dot product self-attention is central and indispensable to state-of-the-art Transformer models.
This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.
arXiv Detail & Related papers (2020-05-02T08:16:19Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.