Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network
- URL: http://arxiv.org/abs/2007.12937v2
- Date: Mon, 10 Aug 2020 19:16:40 GMT
- Title: Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network
- Authors: Ravi Shankar and Hsi-Wei Hsieh and Nicolas Charon and Archana
Venkataraman
- Abstract summary: We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
- Score: 18.275646344620387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel method for emotion conversion in speech based on a chained
encoder-decoder-predictor neural network architecture. The encoder constructs a
latent embedding of the fundamental frequency (F0) contour and the spectrum,
which we regularize using the Large Diffeomorphic Metric Mapping (LDDMM)
registration framework. The decoder uses this embedding to predict the modified
F0 contour in a target emotional class. Finally, the predictor uses the
original spectrum and the modified F0 contour to generate a corresponding
target spectrum. Our joint objective function simultaneously optimizes the
parameters of three model blocks. We show that our method outperforms the
existing state-of-the-art approaches on both, the saliency of emotion
conversion and the quality of resynthesized speech. In addition, the LDDMM
regularization allows our model to convert phrases that were not present in
training, thus providing evidence for out-of-sample generalization.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - A Deep-Bayesian Framework for Adaptive Speech Duration Modification [20.99099283004413]
We use a Bayesian framework to define a latent attention map that links frames of the input and target utterances.
We train a masked convolutional encoder-decoder network to produce this attention map via a version of the mean absolute error loss function.
We show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.
arXiv Detail & Related papers (2021-07-11T05:53:07Z) - UNETR: Transformers for 3D Medical Image Segmentation [8.59571749685388]
We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a pure transformer as the encoder to learn sequence representations of the input volume.
We have extensively validated the performance of our proposed model across different imaging modalities.
arXiv Detail & Related papers (2021-03-18T20:17:15Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator [16.18921154013272]
We introduce a novel method for emotion conversion in speech that does not require parallel training data.
Unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion.
We show that our model generalizes to new speakers by modifying speech produced by Wavenet.
arXiv Detail & Related papers (2020-07-25T13:50:00Z) - MetaSDF: Meta-learning Signed Distance Functions [85.81290552559817]
Generalizing across shapes with neural implicit representations amounts to learning priors over the respective function space.
We formalize learning of a shape space as a meta-learning problem and leverage gradient-based meta-learning algorithms to solve this task.
arXiv Detail & Related papers (2020-06-17T05:14:53Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - On the Encoder-Decoder Incompatibility in Variational Text Modeling and
Beyond [82.18770740564642]
Variational autoencoders (VAEs) combine latent variables with amortized variational inference.
We observe the encoder-decoder incompatibility that leads to poor parameterizations of the data manifold.
We propose Coupled-VAE, which couples a VAE model with a deterministic autoencoder with the same structure.
arXiv Detail & Related papers (2020-04-20T10:34:10Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.