An Overview & Analysis of Sequence-to-Sequence Emotional Voice
Conversion
- URL: http://arxiv.org/abs/2203.15873v1
- Date: Tue, 29 Mar 2022 19:41:34 GMT
- Title: An Overview & Analysis of Sequence-to-Sequence Emotional Voice
Conversion
- Authors: Zijiang Yang, Xin Jing, Andreas Triantafyllopoulos, Meishu Song, Ilhan
Aslan, Bj\"orn W. Schuller
- Abstract summary: Sequence-to-sequence modelling is emerging as a competitive paradigm for models to overcome EVC challenges.
Recent sequence-to-sequence EVC papers were investigated and reviewed from six perspectives.
This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art.
- Score: 8.94336505787464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotional voice conversion (EVC) focuses on converting a speech utterance
from a source to a target emotion; it can thus be a key enabling technology for
human-computer interaction applications and beyond. However, EVC remains an
unsolved research problem with several challenges. In particular, as speech
rate and rhythm are two key factors of emotional conversion, models have to
generate output sequences of differing length. Sequence-to-sequence modelling
is recently emerging as a competitive paradigm for models that can overcome
those challenges. In an attempt to stimulate further research in this promising
new direction, recent sequence-to-sequence EVC papers were systematically
investigated and reviewed from six perspectives: their motivation, training
strategies, model architectures, datasets, model inputs, and evaluation
methods. This information is organised to provide the research community with
an easily digestible overview of the current state-of-the-art. Finally, we
discuss existing challenges of sequence-to-sequence EVC.
Related papers
- State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era [59.279784235147254]
This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing.
The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time.
arXiv Detail & Related papers (2024-06-13T12:51:22Z) - From CNNs to Transformers in Multimodal Human Action Recognition: A Survey [23.674123304219822]
Human action recognition is one of the most widely studied research problems in Computer Vision.
Recent studies have shown that addressing it using multimodal data leads to superior performance.
Recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task.
arXiv Detail & Related papers (2024-05-22T02:11:18Z) - SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
Speech [0.0]
This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications.
Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper.
The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
arXiv Detail & Related papers (2024-03-01T11:28:37Z) - On the Resurgence of Recurrent Models for Long Sequences -- Survey and
Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence.
It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z) - DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel
Generation [37.35829410807451]
Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice.
Recent advancements in EVC have involved the simultaneous modeling of pitch and duration.
This study shifts focus towards parallel speech generation.
arXiv Detail & Related papers (2024-01-16T03:39:35Z) - Predicting Evoked Emotions in Conversations [6.0866477571088895]
We introduce the novel problem of Predicting Emotions in Conversations (PEC) for the next turn (n+1)
We systematically approach the problem by modeling three dimensions inherently connected to evoked emotions in dialogues.
We perform a comprehensive empirical evaluation of the various proposed models for addressing the PEC problem.
arXiv Detail & Related papers (2023-12-31T03:30:42Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for
In-The-Wild Affect Recognition [4.14099371030604]
We present our submission to the 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge.
Recurrence and attention are the two widely used sequence modelling mechanisms in the literature.
We show that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones.
arXiv Detail & Related papers (2022-03-24T18:22:56Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.