An Overview of Affective Speech Synthesis and Conversion in the Deep
Learning Era
- URL: http://arxiv.org/abs/2210.03538v1
- Date: Thu, 6 Oct 2022 13:55:59 GMT
- Title: An Overview of Affective Speech Synthesis and Conversion in the Deep
Learning Era
- Authors: Andreas Triantafyllopoulos, Bj\"orn W. Schuller, G\"ok\c{c}e \.Iymen,
Metin Sezgin, Xiangheng He, Zijiang Yang, Panagiotis Tzirakis, Shuo Liu,
Silvan Mertes, Elisabeth Andr\'e, Ruibo Fu, Jianhua Tao
- Abstract summary: Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions.
Following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion.
Deep learning, the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts.
- Score: 39.91844543424965
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.
Related papers
- SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - Expressivity and Speech Synthesis [51.75420054449122]
We outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity.
We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology.
arXiv Detail & Related papers (2024-04-30T08:47:24Z) - Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation [0.6964027823688135]
Modern conversational systems lack emotional depth and disfluent characteristic of human interactions.
To address this shortcoming, we have designed an innovative speech synthesis pipeline.
Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting.
arXiv Detail & Related papers (2024-03-31T00:38:02Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Neural Speech Embeddings for Speech Synthesis Based on Deep Generative
Networks [27.64740032872726]
We introduce the current brain-to-speech technology with the possibility of speech synthesis from brain signals.
Also, we perform comprehensive analysis on the neural features and neural speech embeddings underlying the neurophysiological activation while performing speech.
arXiv Detail & Related papers (2023-12-10T08:12:08Z) - A Comprehensive Review of Data-Driven Co-Speech Gesture Generation [11.948557523215316]
The automatic generation of such co-speech gestures is a long-standing problem in computer animation.
Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion.
This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
arXiv Detail & Related papers (2023-01-13T00:20:05Z) - Review of end-to-end speech synthesis technology based on deep learning [10.748200013505882]
Research focus is the deep learning-based end-to-end speech synthesis technology.
It mainly consists of three modules: text front-end, acoustic model, and vocoder.
This paper summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks.
arXiv Detail & Related papers (2021-04-20T14:24:05Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.