Wave-U-Net Discriminator: Fast and Lightweight Discriminator for
Generative Adversarial Network-Based Speech Synthesis
- URL: http://arxiv.org/abs/2303.13909v1
- Date: Fri, 24 Mar 2023 10:46:40 GMT
- Title: Wave-U-Net Discriminator: Fast and Lightweight Discriminator for
Generative Adversarial Network-Based Speech Synthesis
- Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki
- Abstract summary: In speech synthesis, a generative adversarial network (GAN) is used to train a generator (speech synthesizer) and a discriminator in a min-max game.
An ensemble of discriminators is commonly used in recent neural vocoders (HiFi-GAN) and end-to-end text-to-speech (TTS) systems.
This study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture.
- Score: 38.27153023145183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In speech synthesis, a generative adversarial network (GAN), training a
generator (speech synthesizer) and a discriminator in a min-max game, is widely
used to improve speech quality. An ensemble of discriminators is commonly used
in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS)
systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such
discriminators allow synthesized speech to adequately approach real speech;
however, they require an increase in the model size and computation time
according to the increase in the number of discriminators. Alternatively, this
study proposes a Wave-U-Net discriminator, which is a single but expressive
discriminator with Wave-U-Net architecture. This discriminator is unique; it
can assess a waveform in a sample-wise manner with the same resolution as the
input signal, while extracting multilevel features via an encoder and decoder
with skip connections. This architecture provides a generator with sufficiently
rich information for the synthesized speech to be closely matched to the real
speech. During the experiments, the proposed ideas were applied to a
representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS).
The results demonstrate that the proposed models can achieve comparable speech
quality with a 2.31 times faster and 14.5 times more lightweight discriminator
when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight
discriminator when used in VITS. Audio samples are available at
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.