Related papers: An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion

An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion

URL: http://arxiv.org/abs/2203.15873v1
Date: Tue, 29 Mar 2022 19:41:34 GMT
Title: An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion
Authors: Zijiang Yang, Xin Jing, Andreas Triantafyllopoulos, Meishu Song, Ilhan Aslan, Bj\"orn W. Schuller
Abstract summary: Sequence-to-sequence modelling is emerging as a competitive paradigm for models to overcome EVC challenges. Recent sequence-to-sequence EVC papers were investigated and reviewed from six perspectives. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art.
Score: 8.94336505787464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion; it can thus be a key enabling technology for human-computer interaction applications and beyond. However, EVC remains an unsolved research problem with several challenges. In particular, as speech rate and rhythm are two key factors of emotional conversion, models have to generate output sequences of differing length. Sequence-to-sequence modelling is recently emerging as a competitive paradigm for models that can overcome those challenges. In an attempt to stimulate further research in this promising new direction, recent sequence-to-sequence EVC papers were systematically investigated and reviewed from six perspectives: their motivation, training strategies, model architectures, datasets, model inputs, and evaluation methods. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art. Finally, we discuss existing challenges of sequence-to-sequence EVC.

Related papers

Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [55.914891182214475]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z)
Shifting AI Efficiency From Model-Centric to Data-Centric Compression [33.41504505470217]
We argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression.<n>We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference.
arXiv Detail & Related papers (2025-05-25T13:51:17Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era [59.279784235147254]
This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time.
arXiv Detail & Related papers (2024-06-13T12:51:22Z)
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey [23.674123304219822]
Human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance. Recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task.
arXiv Detail & Related papers (2024-05-22T02:11:18Z)
SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech [0.0]
This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
arXiv Detail & Related papers (2024-03-01T11:28:37Z)
On the Resurgence of Recurrent Models for Long Sequences -- Survey and Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence. It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z)
Predicting Evoked Emotions in Conversations [6.0866477571088895]
We introduce the novel problem of Predicting Emotions in Conversations (PEC) for the next turn (n+1) We systematically approach the problem by modeling three dimensions inherently connected to evoked emotions in dialogues. We perform a comprehensive empirical evaluation of the various proposed models for addressing the PEC problem.
arXiv Detail & Related papers (2023-12-31T03:30:42Z)
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z)
Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition [4.14099371030604]
We present our submission to the 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. We show that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones.
arXiv Detail & Related papers (2022-03-24T18:22:56Z)
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.