A Comparative Analysis Of Latent Regressor Losses For Singing Voice
Conversion
- URL: http://arxiv.org/abs/2302.13678v1
- Date: Mon, 27 Feb 2023 11:26:57 GMT
- Title: A Comparative Analysis Of Latent Regressor Losses For Singing Voice
Conversion
- Authors: Brendan O'Connor, Simon Dixon
- Abstract summary: Singer identity embedding (SIE) network on mel-spectrograms of singer recordings to produce singer-specific variance encodings.
We propose a pitch-matching mechanism between source and target singers to ensure these evaluations are not influenced by differences in pitch register.
- Score: 15.691936529849539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous research has shown that established techniques for spoken voice
conversion (VC) do not perform as well when applied to singing voice conversion
(SVC). We propose an alternative loss component in a loss function that is
otherwise well-established among VC tasks, which has been shown to improve our
model's SVC performance. We first trained a singer identity embedding (SIE)
network on mel-spectrograms of singer recordings to produce singer-specific
variance encodings using contrastive learning. We subsequently trained a
well-known autoencoder framework (AutoVC) conditioned on these SIEs, and
measured differences in SVC performance when using different latent regressor
loss components. We found that using this loss w.r.t. SIEs leads to better
performance than w.r.t. bottleneck embeddings, where converted audio is more
natural and specific towards target singers. The inclusion of this loss
component has the advantage of explicitly forcing the network to reconstruct
with timbral similarity, and also negates the effect of poor disentanglement in
AutoVC's bottleneck embeddings. We demonstrate peculiar diversity between
computational and human evaluations on singer-converted audio clips, which
highlights the necessity of both. We also propose a pitch-matching mechanism
between source and target singers to ensure these evaluations are not
influenced by differences in pitch register.
Related papers
- SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion [5.538544897623972]
Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training.
This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
arXiv Detail & Related papers (2021-12-06T05:54:11Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - PPG-based singing voice conversion with adversarial representation
learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.