Related papers: Universal Neural Vocoding with Parallel WaveNet

Universal Neural Vocoding with Parallel WaveNet

URL: http://arxiv.org/abs/2102.01106v1
Date: Mon, 1 Feb 2021 19:03:27 GMT
Title: Universal Neural Vocoding with Parallel WaveNet
Authors: Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov
Abstract summary: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases.
Score: 8.6698425961311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We show that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall. We also show that the proposed vocoder outperforms several existing neural vocoder architectures in terms of naturalness and universality. These findings are consistent when we further test on more than 300 open-source voices.

Related papers

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection [60.88800374832363]
Recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker.<n>We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
arXiv Detail & Related papers (2026-02-18T10:29:07Z)
High-Fidelity Music Vocoder using Neural Audio Codecs [18.95453617434051]
DisCoder is a neural vocoder that reconstructs high-fidelity 44.1 kHz audio from mel spectrograms. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
arXiv Detail & Related papers (2025-02-18T11:25:46Z)
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge. We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform. We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z)
VocBench: A Neural Vocoder Benchmark for Speech Synthesis [36.94062576597112]
We present VocBench, a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them. Our results demonstrate that the framework is capable of showing the competitive efficacy and the quality of the synthesized samples for each vocoder.
arXiv Detail & Related papers (2021-12-06T15:09:57Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
Audio Captioning Transformer [44.68751180694813]
Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
arXiv Detail & Related papers (2021-07-21T00:31:50Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length. Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z)
RawNet: Fast End-to-End Neural Vocoder [4.507860128918788]
RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
arXiv Detail & Related papers (2019-04-10T10:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.