Robust One-Shot Singing Voice Conversion
- URL: http://arxiv.org/abs/2210.11096v2
- Date: Fri, 6 Oct 2023 16:18:32 GMT
- Title: Robust One-Shot Singing Voice Conversion
- Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji
- Abstract summary: High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
- Score: 28.707278256253385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in deep generative models has improved the quality of voice
conversion in the speech domain. However, high-quality singing voice conversion
(SVC) of unseen singers remains challenging due to the wider variety of musical
expressions in pitch, loudness, and pronunciation. Moreover, singing voices are
often recorded with reverb and accompaniment music, which make SVC even more
challenging. In this work, we present a robust one-shot SVC (ROSVC) that
performs any-to-any SVC robustly even on such distorted singing voices. To this
end, we first propose a one-shot SVC model based on generative adversarial
networks that generalizes to unseen singers via partial domain conditioning and
learns to accurately recover the target pitch via pitch distribution matching
and AdaIN-skip conditioning. We then propose a two-stage training method called
Robustify that train the one-shot SVC model in the first stage on clean data to
ensure high-quality conversion, and introduces enhancement modules to the
encoders of the model in the second stage to enhance the feature extraction
from distorted singing voices. To further improve the voice quality and pitch
reconstruction accuracy, we finally propose a hierarchical diffusion model for
singing voice neural vocoders. Experimental results show that the proposed
method outperforms state-of-the-art one-shot SVC baselines for both seen and
unseen singers and significantly improves the robustness against distortions.
Related papers
- SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - A Comparative Analysis Of Latent Regressor Losses For Singing Voice
Conversion [15.691936529849539]
Singer identity embedding (SIE) network on mel-spectrograms of singer recordings to produce singer-specific variance encodings.
We propose a pitch-matching mechanism between source and target singers to ensure these evaluations are not influenced by differences in pitch register.
arXiv Detail & Related papers (2023-02-27T11:26:57Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - PPG-based singing voice conversion with adversarial representation
learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.