Related papers: Effects of Convolutional Autoencoder Bottleneck Width on StarGAN-based Singing Technique Conversion

Effects of Convolutional Autoencoder Bottleneck Width on StarGAN-based Singing Technique Conversion

URL: http://arxiv.org/abs/2308.10021v1
Date: Sat, 19 Aug 2023 14:13:28 GMT
Title: Effects of Convolutional Autoencoder Bottleneck Width on StarGAN-based Singing Technique Conversion
Authors: Tung-Cheng Su, Yung-Chuan Chang, Yi-Wen Liu
Abstract summary: Singing technique conversion (STC) refers to the task of converting from one voice technique to another. Previous STC studies, as well as singing voice conversion research in general, have utilized convolutional autoencoders (CAEs) for conversion. We constructed a GAN-based multi-domain STC system which took advantage of the WORLD vocoder representation and the CAE architecture.
Score: 2.2221991003992967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Singing technique conversion (STC) refers to the task of converting from one voice technique to another while leaving the original singer identity, melody, and linguistic components intact. Previous STC studies, as well as singing voice conversion research in general, have utilized convolutional autoencoders (CAEs) for conversion, but how the bottleneck width of the CAE affects the synthesis quality has not been thoroughly evaluated. To this end, we constructed a GAN-based multi-domain STC system which took advantage of the WORLD vocoder representation and the CAE architecture. We varied the bottleneck width of the CAE, and evaluated the conversion results subjectively. The model was trained on a Mandarin dataset which features four singers and four singing techniques: the chest voice, the falsetto, the raspy voice, and the whistle voice. The results show that a wider bottleneck corresponds to better articulation clarity but does not necessarily lead to higher likeness to the target technique. Among the four techniques, we also found that the whistle voice is the easiest target for conversion, while the other three techniques as a source produce more convincing conversion results than the whistle.

Related papers

Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder [1.6416145918859668]
We propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CVAE) We have been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF)
arXiv Detail & Related papers (2025-04-16T11:59:56Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation. Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
PrimaDNN': A Characteristics-aware DNN Customization for Singing Technique Detection [5.399268560100004]
We propose PrimaDNN, a deep neural network model with a characteristics-oriented improvement. In the results of J-POP singing technique detection, PrimaDNN achieved the best results of 44.9% at the overall macro-F measure.
arXiv Detail & Related papers (2023-06-25T10:15:18Z)
A Comparative Analysis Of Latent Regressor Losses For Singing Voice Conversion [15.691936529849539]
Singer identity embedding (SIE) network on mel-spectrograms of singer recordings to produce singer-specific variance encodings. We propose a pitch-matching mechanism between source and target singers to ensure these evaluations are not influenced by differences in pitch register.
arXiv Detail & Related papers (2023-02-27T11:26:57Z)
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems. Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited. Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z)
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Our model is trained only with 20 English speakers. It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
PPG-based singing voice conversion with adversarial representation learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms. Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z)
Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN [81.79070894458322]
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
arXiv Detail & Related papers (2020-08-11T07:29:55Z)
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN. We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content. By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.