Unconditional Audio Generation with Generative Adversarial Networks and
Cycle Regularization
- URL: http://arxiv.org/abs/2005.08526v1
- Date: Mon, 18 May 2020 08:35:16 GMT
- Title: Unconditional Audio Generation with Generative Adversarial Networks and
Cycle Regularization
- Authors: Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang
- Abstract summary: We present a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices.
We employ a hierarchical architecture in the generator to induce some structure in the temporal dimension.
We evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices.
- Score: 48.55126268721948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a recent paper, we have presented a generative adversarial network
(GAN)-based model for unconditional generation of the mel-spectrograms of
singing voices. As the generator of the model is designed to take a
variable-length sequence of noise vectors as input, it can generate
mel-spectrograms of variable length. However, our previous listening test shows
that the quality of the generated audio leaves room for improvement. The
present paper extends and expands that previous work in the following aspects.
First, we employ a hierarchical architecture in the generator to induce some
structure in the temporal dimension. Second, we introduce a cycle
regularization mechanism to the generator to avoid mode collapse. Third, we
evaluate the performance of the new model not only for generating singing
voices, but also for generating speech voices. Evaluation result shows that new
model outperforms the prior one both objectively and subjectively. We also
employ the model to unconditionally generate sequences of piano and violin
music and find the result promising. Audio examples, as well as the code for
implementing our model, will be publicly available online upon paper
publication.
Related papers
- Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Period VITS: Variational Inference with Explicit Pitch Modeling for
End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator.
In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z) - A Generative Model for Raw Audio Using Transformer Architectures [4.594159253008448]
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures.
We propose a deep neural network for generating waveforms, similar to wavenet citeoord2016wavenet.
Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step.
arXiv Detail & Related papers (2021-06-30T13:05:31Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating
Back-Propagation for Saliency Detection [54.98042023365694]
We propose a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples.
The proposed model consists of two sub-models parameterized by neural networks.
arXiv Detail & Related papers (2020-07-23T18:47:36Z) - Speech-to-Singing Conversion based on Boundary Equilibrium GAN [42.739822506085694]
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
arXiv Detail & Related papers (2020-05-28T08:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.