NU-GAN: High resolution neural upsampling with GAN
- URL: http://arxiv.org/abs/2010.11362v1
- Date: Thu, 22 Oct 2020 01:00:23 GMT
- Title: NU-GAN: High resolution neural upsampling with GAN
- Authors: Rithesh Kumar, Kundan Kumar, Vicki Anand, Yoshua Bengio, Aaron
Courville
- Abstract summary: NU-GAN is a new method for resampling audio from lower to higher sampling rates (upsampling)
Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution.
ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.
- Score: 60.02736450639215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose NU-GAN, a new method for resampling audio from
lower to higher sampling rates (upsampling). Audio upsampling is an important
problem since productionizing generative speech technology requires operating
at high sampling rates. Such applications use audio at a resolution of 44.1 kHz
or 48 kHz, whereas current speech synthesis methods are equipped to handle a
maximum of 24 kHz resolution. NU-GAN takes a leap towards solving audio
upsampling as a separate component in the text-to-speech (TTS) pipeline by
leveraging techniques for audio generation using GANs. ABX preference tests
indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz
audio that is distinguishable from original audio only 7.4% higher than random
chance for single speaker dataset, and 10.8% higher than chance for
multi-speaker dataset.
Related papers
- Effects of Dataset Sampling Rate for Noise Cancellation through Deep Learning [1.024113475677323]
This research explores the use of deep neural networks (DNNs) as a superior alternative to traditional noise cancellation techniques.
The ConvTasNET network was trained on datasets such as WHAM!, LibriMix, and the MS-2023 DNS Challenge.
Models trained at higher sampling rates (48kHz) provided much better evaluation metrics against Total Harmonic Distortion (THD) and Quality Prediction For Generative Neural Speech Codecs (WARP-Q) values.
arXiv Detail & Related papers (2024-05-30T16:20:44Z) - AudioSR: Versatile Audio Super-resolution at Scale [32.36683443201372]
We introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types.
Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth.
arXiv Detail & Related papers (2023-09-13T21:00:09Z) - NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling
Rates [0.0]
We introduce NU-Wave 2, a diffusion model for neural audio upsampling.
It generates 48 kHz audio signals from inputs of various sampling rates with a single model.
We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input.
arXiv Detail & Related papers (2022-06-17T04:40:14Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Sampling-Frequency-Independent Audio Source Separation Using Convolution
Layer Based on Impulse Invariant Method [67.24600975813419]
We propose a convolution layer capable of handling arbitrary sampling frequencies by a single deep neural network.
We show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
arXiv Detail & Related papers (2021-05-10T02:33:42Z) - NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling [0.0]
NU-Wave is the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs.
NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test.
arXiv Detail & Related papers (2021-04-06T06:52:53Z) - Pretraining Strategies, Waveform Model Choice, and Acoustic
Configurations for Multi-Speaker End-to-End Speech Synthesis [47.30453049606897]
We find that fine-tuning a multi-speaker model from found audiobook data can improve naturalness and similarity to unseen target speakers of synthetic speech.
We also find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet.
arXiv Detail & Related papers (2020-11-10T00:19:04Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.