CoMoSVC: Consistency Model-based Singing Voice Conversion
- URL: http://arxiv.org/abs/2401.01792v1
- Date: Wed, 3 Jan 2024 15:47:17 GMT
- Title: CoMoSVC: Consistency Model-based Singing Voice Conversion
- Authors: Yiwen Lu, Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo
- Abstract summary: We propose CoMoSVC, a consistency model-based Singing Voice Conversion method.
CoMoSVC aims to achieve both high-quality generation and high-speed sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system.
- Score: 40.08004069518143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.
Related papers
- LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling [7.487807225162913]
Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC)
Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity.
We propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model.
arXiv Detail & Related papers (2024-09-13T07:02:36Z) - FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [28.847324588324152]
We propose FastVoiceGrad, a one-step diffusion-based VC that reduces the number of iterations from dozens to one.
FastVoiceGrad achieves superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed.
arXiv Detail & Related papers (2024-09-03T19:19:48Z) - SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency
Model [41.21042900853639]
We propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step.
By generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time.
arXiv Detail & Related papers (2023-05-11T15:51:46Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.