R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
- URL: http://arxiv.org/abs/2510.20677v1
- Date: Thu, 23 Oct 2025 15:52:03 GMT
- Title: R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
- Authors: Junjie Zheng, Gongyu Chen, Chaofan Ding, Zihao Chen,
- Abstract summary: R2-SVC is a robust and expressive singing voice conversion framework.<n>We enrich speaker representation using domain-specific singing data and public singing corpora.<n>R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
- Score: 9.800248190122545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ($F_0$) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
Related papers
- YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases [16.489839494462124]
Singing voice conversion aims to render the target singer's timbre while preserving melody and lyrics.<n>Existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing.<n>We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning.
arXiv Detail & Related papers (2025-12-04T13:38:50Z) - HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios [18.036712630643205]
HQ-SVC is an efficient framework for high-quality zero-shot singing voice conversion.<n> HQ-SVC first extracts jointly content and speaker features using a decoupled model.<n>It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information.
arXiv Detail & Related papers (2025-11-11T17:33:30Z) - High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture [3.7937714754535503]
SmoothSinger is a conditional diffusion model designed to synthesize high quality and natural singing voices.<n>It refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines.<n> Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-26T17:07:45Z) - Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.<n>StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.<n>Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR [35.710735895190844]
We propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR.
During finetuning, we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations.
Experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions.
arXiv Detail & Related papers (2023-04-11T04:46:12Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.