SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement
- URL: http://arxiv.org/abs/2407.07728v4
- Date: Thu, 17 Oct 2024 09:54:06 GMT
- Title: SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement
- Authors: Zihao Wang, Le Ma, Yongsheng Feng, Xin Pan, Yuhang Jin, Kejun Zhang,
- Abstract summary: Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics.
We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre.
- Score: 14.890331617779546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and 6,367 speakers. We conduct objective and subjective experiments to find that SaMoye outperforms other models in zero-shot SVC tasks even under extreme conditions like converting singing to animals' timbre. The code and weight of SaMoye are available on https://github.com/CarlWangChina/SaMoye-SVC. The weights, code, dataset, and documents of SaMoye are publicly available on \url{https://github.com/CarlWangChina/SaMoye-SVC}.
Related papers
- TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control [58.96445085236971]
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles.
We introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles.
We show that TCSinger outperforms all baseline models in quality synthesis, singer similarity, and style controllability across various tasks.
arXiv Detail & Related papers (2024-09-24T11:18:09Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - PPG-based singing voice conversion with adversarial representation
learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z) - Adversarially Trained Multi-Singer Sequence-To-Sequence Singing
Synthesizer [11.598416444452619]
We design a multi-singer framework to leverage all the existing singing data of different singers.
We incorporate an adversarial task of singer classification to make encoder output less singer dependent.
The proposed synthesizer can generate higher quality singing voice than baseline.
arXiv Detail & Related papers (2020-06-18T07:20:11Z) - Jukebox: A Generative Model for Music [75.242747436901]
Jukebox is a model that generates music with singing in the raw audio domain.
We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes.
We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes.
arXiv Detail & Related papers (2020-04-30T09:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.