MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical
Modeling
- URL: http://arxiv.org/abs/2112.09312v1
- Date: Fri, 17 Dec 2021 04:15:42 GMT
- Title: MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical
Modeling
- Authors: Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim
Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel
- Abstract summary: Musical expression requires control of both what notes are played, and how they are performed.
We introduce MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control.
We demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence.
- Score: 6.256118777336895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Musical expression requires control of both what notes are played, and how
they are performed. Conventional audio synthesizers provide detailed expressive
controls, but at the cost of realism. Black-box neural audio synthesis and
concatenative samplers can produce realistic audio, but have few mechanisms for
control. In this work, we introduce MIDI-DDSP a hierarchical model of musical
instruments that enables both realistic neural audio synthesis and detailed
user control. Starting from interpretable Differentiable Digital Signal
Processing (DDSP) synthesis parameters, we infer musical notes and high-level
properties of their expressive performance (such as timbre, vibrato, dynamics,
and articulation). This creates a 3-level hierarchy (notes, performance,
synthesis) that affords individuals the option to intervene at each level, or
utilize trained priors (performance given notes, synthesis given performance)
for creative assistance. Through quantitative experiments and listening tests,
we demonstrate that this hierarchy can reconstruct high-fidelity audio,
accurately predict performance attributes for a note sequence, independently
manipulate the attributes of a given performance, and as a complete system,
generate realistic audio from a novel note sequence. By utilizing an
interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens
the door to assistive tools to empower individuals across a diverse range of
musical experience.
Related papers
- Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement [0.0]
CoSaRef is a MIDI-to-audio synthesis method that can be developed without MIDI-audio paired datasets.
It first performs concatenative synthesis based on MIDI inputs and then refines the resulting audio into realistic tracks using a diffusion-based deep generative model trained on audio-only datasets.
arXiv Detail & Related papers (2024-10-22T08:01:40Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - DiffMoog: a Differentiable Modular Synthesizer for Sound Matching [48.33168531500444]
DiffMoog is a differentiable modular synthesizer with a comprehensive set of modules typically found in commercial instruments.
Being differentiable, it allows integration into neural networks, enabling automated sound matching.
We introduce an open-source platform that comprises DiffMoog and an end-to-end sound matching framework.
arXiv Detail & Related papers (2024-01-23T08:59:21Z) - Performance Conditioning for Diffusion-Based Multi-Instrument Music
Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment.
Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances.
Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - DDX7: Differentiable FM Synthesis of Musical Instrument Sounds [7.829520196474829]
Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs)
We present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds.
arXiv Detail & Related papers (2022-08-12T08:39:45Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.