HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis
- URL: http://arxiv.org/abs/2010.05646v2
- Date: Fri, 23 Oct 2020 09:12:04 GMT
- Title: HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis
- Authors: Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
- Abstract summary: We propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
A subjective human evaluation of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality.
A small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
- Score: 12.934180951771596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent work on speech synthesis have employed generative adversarial
networks (GANs) to produce raw waveforms. Although such methods improve the
sampling efficiency and memory usage, their sample quality has not yet reached
that of autoregressive and flow-based generative models. In this work, we
propose HiFi-GAN, which achieves both efficient and high-fidelity speech
synthesis. As speech audio consists of sinusoidal signals with various periods,
we demonstrate that modeling periodic patterns of an audio is crucial for
enhancing sample quality. A subjective human evaluation (mean opinion score,
MOS) of a single speaker dataset indicates that our proposed method
demonstrates similarity to human quality while generating 22.05 kHz
high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We
further show the generality of HiFi-GAN to the mel-spectrogram inversion of
unseen speakers and end-to-end speech synthesis. Finally, a small footprint
version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU
with comparable quality to an autoregressive counterpart.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency
Model [41.21042900853639]
We propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step.
By generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time.
arXiv Detail & Related papers (2023-05-11T15:51:46Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z) - VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested
Adversarial Network [9.274656542624658]
A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time.
VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform.
In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time.
arXiv Detail & Related papers (2020-07-30T06:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.