Speech-to-Singing Conversion based on Boundary Equilibrium GAN
- URL: http://arxiv.org/abs/2005.13835v3
- Date: Wed, 5 Aug 2020 13:21:32 GMT
- Title: Speech-to-Singing Conversion based on Boundary Equilibrium GAN
- Authors: Da-Yi Wu, Yi-Hsuan Yang
- Abstract summary: This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
- Score: 42.739822506085694
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper investigates the use of generative adversarial network (GAN)-based
models for converting the spectrogram of a speech signal into that of a singing
one, without reference to the phoneme sequence underlying the speech. This is
achieved by viewing speech-to-singing conversion as a style transfer problem.
Specifically, given a speech input, and optionally the F0 contour of the target
singing, the proposed model generates as the output a singing signal with a
progressive-growing encoder/decoder architecture and boundary equilibrium GAN
loss functions. Our quantitative and qualitative analysis show that the
proposed model generates singing voices with much higher naturalness than an
existing non adversarially-trained baseline. For reproducibility, the code will
be publicly available at a GitHub repository upon paper publication.
Related papers
- DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - Learning and controlling the source-filter representation of speech with
a variational autoencoder [23.05989605017053]
In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors.
We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces.
Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
arXiv Detail & Related papers (2022-04-14T16:13:06Z) - Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks [0.6445605125467573]
This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality.
The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio.
arXiv Detail & Related papers (2021-09-05T15:06:53Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Unsupervised Cross-Domain Speech-to-Speech Conversion with
Time-Frequency Consistency [14.062850439230111]
We propose a condition encouraging spectrogram consistency during the adversarial training procedure.
Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
arXiv Detail & Related papers (2020-05-15T22:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.