PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation
- URL: http://arxiv.org/abs/2408.07547v1
- Date: Wed, 14 Aug 2024 13:36:17 GMT
- Title: PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation
- Authors: Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee,
- Abstract summary: We propose PeriodWave, a novel universal waveform generation model.
We introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal.
We also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference.
- Score: 37.35829410807451
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \url{https://github.com/sh-lee-prml/PeriodWave}.
Related papers
- Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization [37.35829410807451]
This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization.
It only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics.
By scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance.
arXiv Detail & Related papers (2024-08-15T08:34:00Z) - RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction [12.64898580131053]
We introduce RFWave, a cutting-edge multi-band Rectified Flow approach to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens.
RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency.
Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU.
arXiv Detail & Related papers (2024-03-08T03:16:47Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series
Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting.
Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - PeriodNet: A non-autoregressive waveform generation model with a
structure separating periodic and aperiodic components [32.3009716052971]
We propose a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms.
The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature.
arXiv Detail & Related papers (2021-02-15T19:00:08Z) - DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram.
It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.