DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
- URL: http://arxiv.org/abs/2105.13871v1
- Date: Fri, 28 May 2021 14:26:40 GMT
- Title: DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
- Authors: Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng
- Abstract summary: We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
- Score: 51.83469048737548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Singing voice conversion (SVC) is one promising technique which can enrich
the way of human-computer interaction by endowing a computer the ability to
produce high-fidelity and expressive singing voice. In this paper, we propose
DiffSVC, an SVC system based on denoising diffusion probabilistic model.
DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising
module is trained in DiffSVC, which takes destroyed mel spectrogram produced by
the diffusion/forward process and its corresponding step information as input
to predict the added Gaussian noise. We use PPGs, fundamental frequency
features and loudness features as auxiliary input to assist the denoising
process. Experiments show that DiffSVC can achieve superior conversion
performance in terms of naturalness and voice similarity to current
state-of-the-art SVC approaches.
Related papers
- VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space [0.49109372384514843]
VQalAttent is a lightweight model designed to generate fake speech with tunable performance and interpretability.
Our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources.
arXiv Detail & Related papers (2024-11-22T00:21:39Z) - FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [28.847324588324152]
We propose FastVoiceGrad, a one-step diffusion-based VC that reduces the number of iterations from dozens to one.
FastVoiceGrad achieves superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed.
arXiv Detail & Related papers (2024-09-03T19:19:48Z) - SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.