PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate
One-to-Many Mapping
- URL: http://arxiv.org/abs/2211.04610v1
- Date: Tue, 8 Nov 2022 23:37:05 GMT
- Title: PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate
One-to-Many Mapping
- Authors: Junhyeok Lee, Seungu Han, Hyunjae Cho, Wonbin Jung
- Abstract summary: We present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping.
- Score: 0.3277163122167433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous generative adversarial network (GAN)-based neural vocoders are
trained to reconstruct the exact ground truth waveform from the paired
mel-spectrogram and do not consider the one-to-many relationship of speech
synthesis. This conventional training causes overfitting for both the
discriminators and the generator, leading to the periodicity artifacts in the
generated audio signal. In this work, we present PhaseAug, the first
differentiable augmentation for speech synthesis that rotates the phase of each
frequency bin to simulate one-to-many mapping. With our proposed method, we
outperform baselines without any architecture modification. Code and audio
samples will be available at https://github.com/mindslab-ai/phaseaug.
Related papers
- A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge.
We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - A Generative Model for Raw Audio Using Transformer Architectures [4.594159253008448]
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures.
We propose a deep neural network for generating waveforms, similar to wavenet citeoord2016wavenet.
Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step.
arXiv Detail & Related papers (2021-06-30T13:05:31Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.