Parallel waveform synthesis based on generative adversarial networks
with voicing-aware conditional discriminators
- URL: http://arxiv.org/abs/2010.14151v2
- Date: Mon, 26 Apr 2021 08:37:30 GMT
- Title: Parallel waveform synthesis based on generative adversarial networks
with voicing-aware conditional discriminators
- Authors: Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim
- Abstract summary: This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems.
We adopt a projection-based conditioning method that can significantly improve the discriminator's performance.
Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems.
- Score: 25.794915063815665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes voicing-aware conditional discriminators for Parallel
WaveGAN-based waveform synthesis systems. In this framework, we adopt a
projection-based conditioning method that can significantly improve the
discriminator's performance. Furthermore, the conventional discriminator is
separated into two waveform discriminators for modeling voiced and unvoiced
speech. As each discriminator learns the distinctive characteristics of the
harmonic and noise components, respectively, the adversarial training process
becomes more efficient, allowing the generator to produce more realistic speech
waveforms. Subjective test results demonstrate the superiority of the proposed
method over the conventional Parallel WaveGAN and WaveNet systems. In
particular, our speaker-independently trained model within a FastSpeech 2 based
text-to-speech framework achieves the mean opinion scores of 4.20, 4.18, 4.21,
and 4.31 for four Japanese speakers, respectively.
Related papers
- DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Diffusion Conditional Expectation Model for Efficient and Robust Target
Speech Extraction [73.43534824551236]
We propose an efficient generative approach named Conditional Diffusion Expectation Model (DCEM) for Target Speech Extraction (TSE)
It can handle multi- and single-speaker scenarios in both noisy and clean conditions.
Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - Wave-U-Net Discriminator: Fast and Lightweight Discriminator for
Generative Adversarial Network-Based Speech Synthesis [38.27153023145183]
In speech synthesis, a generative adversarial network (GAN) is used to train a generator (speech synthesizer) and a discriminator in a min-max game.
An ensemble of discriminators is commonly used in recent neural vocoders (HiFi-GAN) and end-to-end text-to-speech (TTS) systems.
This study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture.
arXiv Detail & Related papers (2023-03-24T10:46:40Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Parametric Representation for Singing Voice Synthesis: a Comparative
Evaluation [10.37199090634032]
The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis.
The artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.
arXiv Detail & Related papers (2020-06-07T13:06:30Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.