Related papers: YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

URL: http://arxiv.org/abs/2512.04793v1
Date: Thu, 04 Dec 2025 13:38:50 GMT
Title: YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases
Authors: Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen,
Abstract summary: Singing voice conversion aims to render the target singer's timbre while preserving melody and lyrics.<n>Existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing.<n>We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning.
Score: 16.489839494462124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.

Related papers

R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion [9.800248190122545]
R2-SVC is a robust and expressive singing voice conversion framework.<n>We enrich speaker representation using domain-specific singing data and public singing corpora.<n>R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
arXiv Detail & Related papers (2025-10-23T15:52:03Z)
CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance [6.797243060589937]
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences.<n>We present CoMelSinger, a framework that enables structured and disentangled melody control within a discrete timbre modeling paradigm.<n>We show that CoMelSinger achieves notable improvements in pitch accuracy, consistency, and zero-shot transferability over competitive baselines.
arXiv Detail & Related papers (2025-09-24T08:34:19Z)
DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching [17.823734573531]
Key challenge in any-to-any Singing Voice Conversion is adapting unseen speaker timbres to source audio without quality degradation.<n>We propose DAFMSVC, where the self-supervised learning features from the source audio are replaced with the most similar SSL features from the target audio.<n>It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content.
arXiv Detail & Related papers (2025-08-08T03:24:19Z)
Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [7.151257248661491]
CTEFM-VC is a framework that integrates content-aware timbre ensemble modeling with conditional flow matching.<n>Experiments show CTEFM-VC consistently achieves the best performance in all metrics assessing speaker similarity, speech naturalness, and intelligibility.
arXiv Detail & Related papers (2024-11-04T12:23:17Z)
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control [58.96445085236971]
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles.<n>We introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles.
arXiv Detail & Related papers (2024-09-24T11:18:09Z)
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.<n>StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.<n>Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z)
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z)
Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation. We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices. Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z)
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN. We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content. By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.