Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator
- URL: http://arxiv.org/abs/2007.12932v2
- Date: Mon, 10 Aug 2020 19:20:44 GMT
- Title: Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator
- Authors: Ravi Shankar and Jacob Sager and Archana Venkataraman
- Abstract summary: We introduce a novel method for emotion conversion in speech that does not require parallel training data.
Unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion.
We show that our model generalizes to new speakers by modifying speech produced by Wavenet.
- Score: 16.18921154013272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel method for emotion conversion in speech that does not
require parallel training data. Our approach loosely relies on a cycle-GAN
schema to minimize the reconstruction error from converting back and forth
between emotion pairs. However, unlike the conventional cycle-GAN, our
discriminator classifies whether a pair of input real and generated samples
corresponds to the desired emotion conversion (e.g., A to B) or to its inverse
(B to A). We will show that this setup, which we refer to as a variational
cycle-GAN (VC-GAN), is equivalent to minimizing the empirical KL divergence
between the source features and their cyclic counterpart. In addition, our
generator combines a trainable deep network with a fixed generative block to
implement a smooth and invertible transformation on the input features, in our
case, the fundamental frequency (F0) contour. This hybrid architecture
regularizes our adversarial training procedure. We use crowd sourcing to
evaluate both the emotional saliency and the quality of synthesized speech.
Finally, we show that our model generalizes to new speakers by modifying speech
produced by Wavenet.
Related papers
- Anisotropic multiresolution analyses for deep fake detection [4.903718320156974]
Generative Adversarial Networks (GANs) have paved the path towards entirely new media generation capabilities.
They can also be misused and abused to fabricate elaborate lies, capable of stirring up the public debate.
Previous studies have tackled this task by using classical machine learning techniques, such as k-nearest neighbours and eigenfaces.
We argue that, since GANs primarily utilize isotropic convolutions to generate their output, they leave clear traces, their fingerprint, in the coefficient distribution on sub-bands extracted by anisotropic transformations.
arXiv Detail & Related papers (2022-10-26T17:26:09Z) - Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in
Transformer-Based Variational AutoEncoder for Diverse Text Generation [85.5379146125199]
Variational Auto-Encoder (VAE) has been widely adopted in text generation.
We propose TRACE, a Transformer-based recurrent VAE structure.
arXiv Detail & Related papers (2022-10-22T10:25:35Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Axial Residual Networks for CycleGAN-based Voice Conversion [0.0]
We propose a novel architecture and improved training objectives for non-parallel voice conversion.
Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram.
We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.
arXiv Detail & Related papers (2021-02-16T10:55:35Z) - Class-Conditional Defense GAN Against End-to-End Speech Attacks [82.21746840893658]
We propose a novel approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo.
Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal.
Our defense-GAN considerably outperforms conventional defense algorithms in terms of word error rate and sentence level recognition accuracy.
arXiv Detail & Related papers (2020-10-22T00:02:02Z) - Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN [81.79070894458322]
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages.
Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer.
We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
arXiv Detail & Related papers (2020-08-11T07:29:55Z) - Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network [18.275646344620387]
We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
arXiv Detail & Related papers (2020-07-25T13:59:22Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.