CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency
Model
- URL: http://arxiv.org/abs/2305.06908v4
- Date: Sun, 29 Oct 2023 14:12:08 GMT
- Title: CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency
Model
- Authors: Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo
- Abstract summary: We propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step.
By generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time.
- Score: 41.21042900853639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Denoising diffusion probabilistic models (DDPMs) have shown promising
performance for speech synthesis. However, a large number of iterative steps
are required to achieve high sample quality, which restricts the inference
speed. Maintaining sample quality while increasing sampling speed has become a
challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based
"Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a
single diffusion sampling step while achieving high audio quality. The
consistency constraint is applied to distill a consistency model from a
well-designed diffusion-based teacher model, which ultimately yields superior
performances in the distilled CoMoSpeech. Our experiments show that by
generating audio recordings by a single sampling step, the CoMoSpeech achieves
an inference speed more than 150 times faster than real-time on a single NVIDIA
A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based
speech synthesis truly practical. Meanwhile, objective and subjective
evaluations on text-to-speech and singing voice synthesis show that the
proposed teacher models yield the best audio quality, and the one-step sampling
based CoMoSpeech achieves the best inference speed with better or comparable
audio quality to other conventional multi-step diffusion model baselines. Audio
samples are available at https://comospeech.github.io/.
Related papers
- DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization [12.310318928818546]
We propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization.
We show DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity.
This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - FlashSpeech: Efficient Zero-Shot Speech Synthesis [37.883762387219676]
FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work.
We show that FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity.
arXiv Detail & Related papers (2024-04-23T02:57:46Z) - CoMoSVC: Consistency Model-based Singing Voice Conversion [40.08004069518143]
We propose CoMoSVC, a consistency model-based Singing Voice Conversion method.
CoMoSVC aims to achieve both high-quality generation and high-speed sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system.
arXiv Detail & Related papers (2024-01-03T15:47:17Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech [37.29193613404699]
DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
arXiv Detail & Related papers (2022-12-30T02:31:35Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - FastSpeech 2: Fast and High-Quality End-to-End Text to Speech [189.05831125931053]
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality.
FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss.
We propose FastSpeech 2, which directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch,
arXiv Detail & Related papers (2020-06-08T13:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.