Anyone GAN Sing
- URL: http://arxiv.org/abs/2102.11058v1
- Date: Mon, 22 Feb 2021 14:30:58 GMT
- Title: Anyone GAN Sing
- Authors: Shreeviknesh Sankaran, Sukavanan Nanjundan, G. Paavai Anand
- Abstract summary: We present a method to synthesize the singing voice of a person using a Convolutional Long Short-term Memory (ConvLSTM) based GAN.
Our work is inspired by WGANSing by Chandna et al.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of audio synthesis has been increasingly solved using deep neural
networks. With the introduction of Generative Adversarial Networks (GAN),
another efficient and adjective path has opened up to solve this problem. In
this paper, we present a method to synthesize the singing voice of a person
using a Convolutional Long Short-term Memory (ConvLSTM) based GAN optimized
using the Wasserstein loss function. Our work is inspired by WGANSing by
Chandna et al. Our model inputs consecutive frame-wise linguistic and frequency
features, along with singer identity and outputs vocoder features. We train the
model on a dataset of 48 English songs sung and spoken by 12 non-professional
singers. For inference, sequential blocks are concatenated using an overlap-add
procedure. We test the model using the Mel-Cepstral Distance metric and a
subjective listening test with 18 participants.
Related papers
- Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework [7.12217278294376]
Our proposed model converts song data with choral singing which is commonly contained in popular music.
We exploit the pre-trained NANSY++ to convert choral singing into clean, non-overlapped audio.
We experimentally evaluated the EEND model trained with a dataset using annotated popular duet songs.
arXiv Detail & Related papers (2024-06-24T04:48:29Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using
Mel-spectrograms [42.59716267275078]
We propose a novel neural network model called KaraSinger for a singing voice synthesis task named score-free SVS.
KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics.
We validate the effectiveness of the proposed design choices using a proprietary collection of 550 English pop songs sung by multiple amateur singers.
arXiv Detail & Related papers (2021-10-08T10:00:23Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.