An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice
Quality and Data Augmentation
- URL: http://arxiv.org/abs/2107.08361v1
- Date: Sun, 18 Jul 2021 04:28:47 GMT
- Title: An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice
Quality and Data Augmentation
- Authors: Xiangheng He, Junjie Chen, Georgios Rizos, Bj\"orn W. Schuller
- Abstract summary: We propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion.
The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion.
In data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1.
- Score: 8.017817904347964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional Voice Conversion (EVC) aims to convert the emotional style of a
source speech signal to a target style while preserving its content and speaker
identity information. Previous emotional conversion studies do not disentangle
emotional information from emotion-independent information that should be
preserved, thus transforming it all in a monolithic manner and generating audio
of low quality, with linguistic distortions. To address this distortion
problem, we propose a novel StarGAN framework along with a two-stage training
process that separates emotional features from those independent of emotion by
using an autoencoder with two encoders as the generator of the Generative
Adversarial Network (GAN). The proposed model achieves favourable results in
both the objective evaluation and the subjective evaluation in terms of
distortion, which reveals that the proposed model can effectively reduce
distortion. Furthermore, in data augmentation experiments for end-to-end speech
emotion recognition, the proposed StarGAN model achieves an increase of 2% in
Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which
indicates that the proposed model is more valuable for data augmentation.
Related papers
- Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel
Generation [37.35829410807451]
Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice.
Recent advancements in EVC have involved the simultaneous modeling of pitch and duration.
This study shifts focus towards parallel speech generation.
arXiv Detail & Related papers (2024-01-16T03:39:35Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model
with Transformer [11.543807097834785]
We propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task.
In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech.
The results show that our proposed model is able to convert emotion with higher strength and quality.
arXiv Detail & Related papers (2021-11-30T06:33:57Z) - Decoupling Speaker-Independent Emotions for Voice Conversion Via
Source-Filter Networks [14.55242023708204]
We propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion features.
Our SFEVC model consists of multi-channel encoders, emotion separate encoders, and one decoder.
arXiv Detail & Related papers (2021-10-04T03:14:48Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - VAW-GAN for Disentanglement and Recomposition of Emotional Elements in
Speech [91.92456020841438]
We study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN)
We propose a speaker-dependent EVC framework that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion.
Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-11-03T08:49:33Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.