An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice
Quality and Data Augmentation
- URL: http://arxiv.org/abs/2107.08361v1
- Date: Sun, 18 Jul 2021 04:28:47 GMT
- Title: An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice
Quality and Data Augmentation
- Authors: Xiangheng He, Junjie Chen, Georgios Rizos, Bj\"orn W. Schuller
- Abstract summary: We propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion.
The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion.
In data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1.
- Score: 8.017817904347964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional Voice Conversion (EVC) aims to convert the emotional style of a
source speech signal to a target style while preserving its content and speaker
identity information. Previous emotional conversion studies do not disentangle
emotional information from emotion-independent information that should be
preserved, thus transforming it all in a monolithic manner and generating audio
of low quality, with linguistic distortions. To address this distortion
problem, we propose a novel StarGAN framework along with a two-stage training
process that separates emotional features from those independent of emotion by
using an autoencoder with two encoders as the generator of the Generative
Adversarial Network (GAN). The proposed model achieves favourable results in
both the objective evaluation and the subjective evaluation in terms of
distortion, which reveals that the proposed model can effectively reduce
distortion. Furthermore, in data augmentation experiments for end-to-end speech
emotion recognition, the proposed StarGAN model achieves an increase of 2% in
Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which
indicates that the proposed model is more valuable for data augmentation.
Related papers
- EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion [49.55774551366049]
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation.
We propose an EmotiveTalk framework to address these issues.
Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation.
arXiv Detail & Related papers (2024-11-23T04:38:51Z) - E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation [3.544259721580075]
We propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG)
We evaluate E2E-AFG on six representative knowledge-intensive language datasets.
arXiv Detail & Related papers (2024-11-01T08:02:09Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model
with Transformer [11.543807097834785]
We propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task.
In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech.
The results show that our proposed model is able to convert emotion with higher strength and quality.
arXiv Detail & Related papers (2021-11-30T06:33:57Z) - Decoupling Speaker-Independent Emotions for Voice Conversion Via
Source-Filter Networks [14.55242023708204]
We propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion features.
Our SFEVC model consists of multi-channel encoders, emotion separate encoders, and one decoder.
arXiv Detail & Related papers (2021-10-04T03:14:48Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - VAW-GAN for Disentanglement and Recomposition of Emotional Elements in
Speech [91.92456020841438]
We study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN)
We propose a speaker-dependent EVC framework that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion.
Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-11-03T08:49:33Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.