NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
- URL: http://arxiv.org/abs/2104.02321v1
- Date: Tue, 6 Apr 2021 06:52:53 GMT
- Title: NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
- Authors: Junhyeok Lee and Seungu Han
- Abstract summary: NU-Wave is the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs.
NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce NU-Wave, the first neural audio upsampling model
to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs,
while prior works could generate only up to 16kHz. NU-Wave is the first
diffusion probabilistic model for audio super-resolution which is engineered
based on neural vocoders. NU-Wave generates high-quality audio that achieves
high performance in terms of signal-to-noise ratio (SNR), log-spectral distance
(LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the
baseline models despite the substantially smaller model capacity (3.0M
parameters) than baselines (5.4-21%). The audio samples of our model are
available at https://mindslab-ai.github.io/nuwave, and the code will be made
available soon.
Related papers
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling
Rates [0.0]
We introduce NU-Wave 2, a diffusion model for neural audio upsampling.
It generates 48 kHz audio signals from inputs of various sampling rates with a single model.
We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input.
arXiv Detail & Related papers (2022-06-17T04:40:14Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - NU-GAN: High resolution neural upsampling with GAN [60.02736450639215]
NU-GAN is a new method for resampling audio from lower to higher sampling rates (upsampling)
Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution.
ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.
arXiv Detail & Related papers (2020-10-22T01:00:23Z) - DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram.
It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.