StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization
- URL: http://arxiv.org/abs/2011.01557v2
- Date: Fri, 12 Feb 2021 18:21:06 GMT
- Title: StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization
- Authors: Ahmed Mustafa, Nicola Pia, Guillaume Fuchs
- Abstract summary: StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
- Score: 9.866072912049031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, neural vocoders have surpassed classical speech generation
approaches in naturalness and perceptual quality of the synthesized speech.
Computationally heavy models like WaveNet and WaveGlow achieve best results,
while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior
in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight
neural vocoder allowing synthesis of high-fidelity speech with low
computational complexity. StyleMelGAN employs temporal adaptive normalization
to style a low-dimensional noise vector with the acoustic features of the
target speech. For efficient training, multiple random-window discriminators
adversarially evaluate the speech signal analyzed by a filter bank, with
regularization provided by a multi-scale spectral reconstruction loss. The
highly parallelizable speech generation is several times faster than real-time
on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN
outperforms prior neural vocoders in copy-synthesis and Text-to-Speech
scenarios.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Wave-U-Net Discriminator: Fast and Lightweight Discriminator for
Generative Adversarial Network-Based Speech Synthesis [38.27153023145183]
In speech synthesis, a generative adversarial network (GAN) is used to train a generator (speech synthesizer) and a discriminator in a min-max game.
An ensemble of discriminators is commonly used in recent neural vocoders (HiFi-GAN) and end-to-end text-to-speech (TTS) systems.
This study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture.
arXiv Detail & Related papers (2023-03-24T10:46:40Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.