HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform
- URL: http://arxiv.org/abs/2309.09493v1
- Date: Mon, 18 Sep 2023 05:30:15 GMT
- Title: HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform
- Authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
- Abstract summary: We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain.
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN.
Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
- Score: 21.896817015593122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in speech synthesis have leveraged GAN-based networks
like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from
mel-spectrograms. However, these networks are computationally expensive and
parameter-heavy. iSTFTNet addresses these limitations by integrating inverse
short-time Fourier transform (iSTFT) into the network, achieving both speed and
parameter efficiency. In this paper, we introduce an extension to iSTFTNet,
termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the
time-frequency domain that uses a sinusoidal source from the fundamental
frequency (F0) inferred via a pre-trained F0 estimation network for fast
inference speed. Subjective evaluations on LJSpeech show that our model
significantly outperforms both iSTFTNet and HiFi-GAN, achieving
ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on
LibriTTS for unseen speakers and achieves comparable performance to BigVGAN
while being four times faster with only $1/6$ of the parameters. Our work sets
a new benchmark for efficient, high-quality neural vocoding, paving the way for
real-time applications that demand high quality speech synthesis.
Related papers
- OFDM-Standard Compatible SC-NOFS Waveforms for Low-Latency and Jitter-Tolerance Industrial IoT Communications [53.398544571833135]
This work proposes a spectrally efficient irregular Sinc (irSinc) shaping technique, revisiting the traditional Sinc back to 1924.
irSinc yields a signal with increased spectral efficiency without sacrificing error performance.
Our signal achieves faster data transmission within the same spectral bandwidth through 5G standard signal configuration.
arXiv Detail & Related papers (2024-06-07T09:20:30Z) - Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning [42.862705980039784]
Transformer has shown promise in reinforcement learning to model time-varying features.
It still suffers from the issues of low data efficiency and high inference latency.
In this paper, we propose to investigate the task from a new perspective of the frequency domain.
arXiv Detail & Related papers (2024-05-30T09:43:59Z) - WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series
Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting.
Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z) - Adaptive Frequency Filters As Efficient Global Token Mixers [100.27957692579892]
We show that adaptive frequency filters can serve as efficient global token mixers.
We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet.
arXiv Detail & Related papers (2023-07-26T07:42:28Z) - FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net
Encoder With Multiple STFTs [1.8047694351309207]
FastFit is a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs)
We show that FastFit achieves nearly twice the generation speed of baseline-based vocoders while maintaining high sound quality.
arXiv Detail & Related papers (2023-05-18T09:05:17Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural
Vocoder [29.219277429553788]
We introduce the source-filter theory into HiFi-GAN to achieve high voice quality and pitch controllability.
Our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU.
Unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.
arXiv Detail & Related papers (2022-10-27T15:19:09Z) - WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis [4.689359813220365]
We propose an effective and lightweight neural vocoder called WOLONet.
In this paper, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights.
The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.
arXiv Detail & Related papers (2022-06-20T17:58:52Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Fourier Space Losses for Efficient Perceptual Image Super-Resolution [131.50099891772598]
We show that it is possible to improve the performance of a recently introduced efficient generator architecture solely with the application of our proposed loss functions.
We show that our losses' direct emphasis on the frequencies in Fourier-space significantly boosts the perceptual image quality.
The trained generator achieves comparable results with and is 2.4x and 48x faster than state-of-the-art perceptual SR methods RankSRGAN and SRFlow respectively.
arXiv Detail & Related papers (2021-06-01T20:34:52Z) - A non-causal FFTNet architecture for speech enhancement [18.583426581177278]
We suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet.
By suggesting a shallow network and applying non-causality within certain limits, the suggested FFTNet uses much fewer parameters compared to other neural network based approaches.
arXiv Detail & Related papers (2020-06-08T10:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.