Exploring Quality and Generalizability in Parameterized Neural Audio
Effects
- URL: http://arxiv.org/abs/2006.05584v1
- Date: Wed, 10 Jun 2020 00:52:08 GMT
- Title: Exploring Quality and Generalizability in Parameterized Neural Audio
Effects
- Authors: William Mitchell, Scott H. Hawley
- Abstract summary: Deep neural networks have shown promise for music audio signal processing applications.
Results to date have tended to be constrained by low sample rates, noise, narrow domains of signal types, and/or lack of parameterized controls.
This work expands on prior research published on modeling nonlinear time-dependent signal processing effects.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have shown promise for music audio signal processing
applications, often surpassing prior approaches, particularly as end-to-end
models in the waveform domain. Yet results to date have tended to be
constrained by low sample rates, noise, narrow domains of signal types, and/or
lack of parameterized controls (i.e. "knobs"), making their suitability for
professional audio engineering workflows still lacking. This work expands on
prior research published on modeling nonlinear time-dependent signal processing
effects associated with music production by means of a deep neural network, one
which includes the ability to emulate the parameterized settings you would see
on an analog piece of equipment, with the goal of eventually producing
commercially viable, high quality audio, i.e. 44.1 kHz sampling rate at 16-bit
resolution. The results in this paper highlight progress in modeling these
effects through architecture and optimization changes, towards increasing
computational efficiency, lowering signal-to-noise ratio, and extending to a
larger variety of nonlinear audio effects. Toward these ends, the strategies
employed involved a three-pronged approach: model speed, model accuracy, and
model generalizability. Most of the presented methods provide marginal or no
increase in output accuracy over the original model, with the exception of
dataset manipulation. We found that limiting the audio content of the dataset,
for example using datasets of just a single instrument, provided a significant
improvement in model accuracy over models trained on more general datasets.
Related papers
- Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling [0.0]
This article explores the application of machine learning advancements for virtual analog modeling.
We compare State-Space models and Linear Recurrent Units against the more common Long Short-Term Memory networks.
arXiv Detail & Related papers (2024-05-07T08:47:40Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Diffusion Models for Audio Restoration [22.385385150594185]
We present here audio restoration algorithms based on diffusion models.
We show that diffusion models can combine the best of both worlds and offer the opportunity to design audio restoration algorithms.
We explain the diffusion formalism and its application to the conditional generation of clean audio signals.
arXiv Detail & Related papers (2024-02-15T09:36:36Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.