CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model
with Transformer
- URL: http://arxiv.org/abs/2111.15159v1
- Date: Tue, 30 Nov 2021 06:33:57 GMT
- Title: CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model
with Transformer
- Authors: Changzeng Fu, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro
- Abstract summary: We propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task.
In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech.
The results show that our proposed model is able to convert emotion with higher strength and quality.
- Score: 11.543807097834785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we explore the transformer's ability to capture
intra-relations among frames by augmenting the receptive field of models.
Concretely, we propose a CycleGAN-based model with the transformer and
investigate its ability in the emotional voice conversion task. In the training
procedure, we adopt curriculum learning to gradually increase the frame length
so that the model can see from the short segment till the entire speech. The
proposed method was evaluated on the Japanese emotional speech dataset and
compared to several baselines (ACVAE, CycleGAN) with objective and subjective
evaluations. The results show that our proposed model is able to convert
emotion with higher strength and quality.
Related papers
- Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity [11.302828987873497]
We present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task.
We show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result.
arXiv Detail & Related papers (2024-10-09T13:06:43Z) - CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework
for Zero-Shot Electroencephalography Signal Conversion [49.1574468325115]
A key aim in EEG analysis is to extract the underlying neural activation (content) as well as to account for the individual subject variability (style)
Inspired by recent advancements in voice conversion technologies, we propose a novel contrastive split-latent permutation autoencoder (CSLP-AE) framework that directly optimize for EEG conversion.
arXiv Detail & Related papers (2023-11-13T22:46:43Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice
Quality and Data Augmentation [8.017817904347964]
We propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion.
The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion.
In data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1.
arXiv Detail & Related papers (2021-07-18T04:28:47Z) - Axial Residual Networks for CycleGAN-based Voice Conversion [0.0]
We propose a novel architecture and improved training objectives for non-parallel voice conversion.
Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram.
We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.
arXiv Detail & Related papers (2021-02-16T10:55:35Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator [16.18921154013272]
We introduce a novel method for emotion conversion in speech that does not require parallel training data.
Unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion.
We show that our model generalizes to new speakers by modifying speech produced by Wavenet.
arXiv Detail & Related papers (2020-07-25T13:50:00Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z) - EEG based Continuous Speech Recognition using Transformers [13.565270550358397]
We investigate continuous speech recognition using electroencephalography (EEG) features using end-to-end transformer based automatic speech recognition (ASR) model.
Our results demonstrate that transformer based model demonstrate faster training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models.
arXiv Detail & Related papers (2019-12-31T08:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.