Unsupervised Cross-Domain Singing Voice Conversion
- URL: http://arxiv.org/abs/2008.02830v1
- Date: Thu, 6 Aug 2020 18:29:11 GMT
- Title: Unsupervised Cross-Domain Singing Voice Conversion
- Authors: Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman
- Abstract summary: We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
- Score: 105.1021715879586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a wav-to-wav generative model for the task of singing voice
conversion from any identity. Our method utilizes both an acoustic model,
trained for the task of automatic speech recognition, together with melody
extracted features to drive a waveform-based generator. The proposed generative
architecture is invariant to the speaker's identity and can be trained to
generate target singers from unlabeled training data, using either speech or
singing sources. The model is optimized in an end-to-end fashion without any
manual supervision, such as lyrics, musical notes or parallel samples. The
proposed approach is fully-convolutional and can generate audio in real-time.
Experiments show that our method significantly outperforms the baseline methods
while generating convincingly better audio samples than alternative attempts.
Related papers
- Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model.
Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example.
We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length.
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations.
Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z) - Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.