Related papers: Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

URL: http://arxiv.org/abs/2010.14151v2
Date: Mon, 26 Apr 2021 08:37:30 GMT
Title: Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators
Authors: Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim
Abstract summary: This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. We adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems.
Score: 25.794915063815665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech. As each discriminator learns the distinctive characteristics of the harmonic and noise components, respectively, the adversarial training process becomes more efficient, allowing the generator to produce more realistic speech waveforms. Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems. In particular, our speaker-independently trained model within a FastSpeech 2 based text-to-speech framework achieves the mean opinion scores of 4.20, 4.18, 4.21, and 4.31 for four Japanese speakers, respectively.

Related papers

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis [38.27153023145183]
In speech synthesis, a generative adversarial network (GAN) is used to train a generator (speech synthesizer) and a discriminator in a min-max game. An ensemble of discriminators is commonly used in recent neural vocoders (HiFi-GAN) and end-to-end text-to-speech (TTS) systems. This study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture.
arXiv Detail & Related papers (2023-03-24T10:46:40Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images. We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z)
A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation [10.37199090634032]
The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis. The artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.
arXiv Detail & Related papers (2020-06-07T13:06:30Z)
End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner. Our proposed generator is feed-forward and thus efficient for both training and inference. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.