Related papers: Real-time Timbre Transfer and Sound Synthesis using DDSP

Real-time Timbre Transfer and Sound Synthesis using DDSP

URL: http://arxiv.org/abs/2103.07220v1
Date: Fri, 12 Mar 2021 11:49:51 GMT
Title: Real-time Timbre Transfer and Sound Synthesis using DDSP
Authors: Francesco Ganis, Erik Frej Knudesn, S{\o}ren V. K. Lyster, Robin Otterbein, David S\"udholt and Cumhur Erkut
Abstract summary: We present a real-time implementation of the MagentaP library embedded in a virtual synthesizer as a plug-in. We focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI. We developed a GUI for intuitive high-level controls which can be used for post-processing and manipulating the parameters estimated by the neural network.
Score: 1.7942265700058984
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Neural audio synthesis is an actively researched topic, having yielded a wide range of techniques that leverages machine learning architectures. Google Magenta elaborated a novel approach called Differential Digital Signal Processing (DDSP) that incorporates deep neural networks with preconditioned digital signal processing techniques, reaching state-of-the-art results especially in timbre transfer applications. However, most of these techniques, including the DDSP, are generally not applicable in real-time constraints, making them ineligible in a musical workflow. In this paper, we present a real-time implementation of the DDSP library embedded in a virtual synthesizer as a plug-in that can be used in a Digital Audio Workstation. We focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI. Furthermore, we developed a GUI for intuitive high-level controls which can be used for post-processing and manipulating the parameters estimated by the neural network. We have conducted a user experience test with seven participants online. The results indicated that our users found the interface appealing, easy to understand, and worth exploring further. At the same time, we have identified issues in the timbre transfer quality, in some components we did not implement, and in installation and distribution of our plugin. The next iteration of our design will address these issues. Our real-time MATLAB and JUCE implementations are available at https://github.com/SMC704/juce-ddsp and https://github.com/SMC704/matlab-ddsp , respectively.

Related papers

Designing Neural Synthesizers for Low-Latency Interaction [8.27756937768806]
We investigate the sources of latency and jitter typically found in interactive Neural Audio Synthesis (NAS) models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder. This culminates with a model we call BRAVE, which is low-latency and exhibits better pitch and loudness replication.
arXiv Detail & Related papers (2025-03-14T16:30:31Z)
TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
DDX7: Differentiable FM Synthesis of Musical Instrument Sounds [7.829520196474829]
Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs) We present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds.
arXiv Detail & Related papers (2022-08-12T08:39:45Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z)
Latent Space Explorations of Singing Voice Synthesis using DDSP [2.7920304852537527]
Machine learning based singing voice models require large datasets and lengthy training times. We present a lightweight architecture that is able to output song-like utterances conditioned only on pitch and amplitude. We present two zero-configuration tools to train new models and experiment with them.
arXiv Detail & Related papers (2021-03-12T10:38:29Z)
End-to-End Object Detection with Fully Convolutional Network [71.56728221604158]
We introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection. A simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. Our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets.
arXiv Detail & Related papers (2020-12-07T09:14:55Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
DDSP: Differentiable Digital Signal Processing [13.448630251745163]
We introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. We achieve high-fidelity generation without the need for large autoregressive models or adversarial losses. P enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning.
arXiv Detail & Related papers (2020-01-14T06:49:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.