SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion
- URL: http://arxiv.org/abs/2406.05692v1
- Date: Sun, 9 Jun 2024 08:34:01 GMT
- Title: SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion
- Authors: Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li,
- Abstract summary: We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
- Score: 12.454955437047573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
Related papers
- LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling [7.487807225162913]
Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC)
Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity.
We propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model.
arXiv Detail & Related papers (2024-09-13T07:02:36Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - CoMoSVC: Consistency Model-based Singing Voice Conversion [40.08004069518143]
We propose CoMoSVC, a consistency model-based Singing Voice Conversion method.
CoMoSVC aims to achieve both high-quality generation and high-speed sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system.
arXiv Detail & Related papers (2024-01-03T15:47:17Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model.
In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers.
By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - SingAug: Data Augmentation for Singing Voice Synthesis with
Cycle-consistent Training Strategy [69.24683717901262]
Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities.
In this work, we explore different data augmentation methods to boost the training of SVS systems.
To further stabilize the training, we introduce the cycle-consistent training strategy.
arXiv Detail & Related papers (2022-03-31T12:50:10Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.