Related papers: HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

URL: http://arxiv.org/abs/2511.08496v3
Date: Sat, 15 Nov 2025 15:29:38 GMT
Title: HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios
Authors: Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li,
Abstract summary: HQ-SVC is an efficient framework for high-quality zero-shot singing voice conversion.<n> HQ-SVC first extracts jointly content and speaker features using a decoupled model.<n>It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information.
Score: 18.036712630643205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

Related papers

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching [17.823734573531]
Key challenge in any-to-any Singing Voice Conversion is adapting unseen speaker timbres to source audio without quality degradation.<n>We propose DAFMSVC, where the self-supervised learning features from the source audio are replaced with the most similar SSL features from the target audio.<n>It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content.
arXiv Detail & Related papers (2025-08-08T03:24:19Z)
LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling [7.487807225162913]
Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC)<n>Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity.<n>We propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model.
arXiv Detail & Related papers (2024-09-13T07:02:36Z)
SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC) We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.<n>StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.<n>Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z)
Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation. We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices. Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z)
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC) The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. We propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z)
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z)
NoiseVC: Towards High Quality Zero-Shot Voice Conversion [2.3224617218247126]
NoiseVC is an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC) We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
arXiv Detail & Related papers (2021-04-13T10:12:38Z)
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity. We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.