NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling
Rates
- URL: http://arxiv.org/abs/2206.08545v1
- Date: Fri, 17 Jun 2022 04:40:14 GMT
- Title: NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling
Rates
- Authors: Seungu Han, Junhyeok Lee
- Abstract summary: We introduce NU-Wave 2, a diffusion model for neural audio upsampling.
It generates 48 kHz audio signals from inputs of various sampling rates with a single model.
We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventionally, audio super-resolution models fixed the initial and the
target sampling rates, which necessitate the model to be trained for each pair
of sampling rates. We introduce NU-Wave 2, a diffusion model for neural audio
upsampling that enables the generation of 48 kHz audio signals from inputs of
various sampling rates with a single model. Based on the architecture of
NU-Wave, NU-Wave 2 uses short-time Fourier convolution (STFC) to generate
harmonics to resolve the main failure modes of NU-Wave, and incorporates
bandwidth spectral feature transform (BSFT) to condition the bandwidths of
inputs in the frequency domain. We experimentally demonstrate that NU-Wave 2
produces high-resolution audio regardless of the sampling rate of input while
requiring fewer parameters than other models. The official code and the audio
samples are available at https://mindslab-ai.github.io/nuwave2.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Sampling-Frequency-Independent Audio Source Separation Using Convolution
Layer Based on Impulse Invariant Method [67.24600975813419]
We propose a convolution layer capable of handling arbitrary sampling frequencies by a single deep neural network.
We show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
arXiv Detail & Related papers (2021-05-10T02:33:42Z) - NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling [0.0]
NU-Wave is the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs.
NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test.
arXiv Detail & Related papers (2021-04-06T06:52:53Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.