DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
- URL: http://arxiv.org/abs/2401.08095v4
- Date: Tue, 21 Jan 2025 02:51:53 GMT
- Title: DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
- Authors: Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, Seong-Whan Lee,
- Abstract summary: DurFlex-EVC is a duration-flexible EVC framework that operates without the need for text or alignment information.
We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment.
We also design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech.
- Score: 34.19748360507656
- License:
- Abstract: Emotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.
Related papers
- Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Controlling Emotion in Text-to-Speech with Natural Language Prompts [29.013577423045255]
We propose a system conditioned on embeddings derived from an emotionally rich text iteration that serves as prompt.
A joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture.
Our approach is trained on merged emotional speech and text datasets and varies prompts in each training to increase the generalization capabilities of the model.
arXiv Detail & Related papers (2024-06-10T15:58:42Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Decoupling Speaker-Independent Emotions for Voice Conversion Via
Source-Filter Networks [14.55242023708204]
We propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion features.
Our SFEVC model consists of multi-channel encoders, emotion separate encoders, and one decoder.
arXiv Detail & Related papers (2021-10-04T03:14:48Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.