Analysis and Utilization of Entrainment on Acoustic and Emotion Features
in User-agent Dialogue
- URL: http://arxiv.org/abs/2212.03398v1
- Date: Wed, 7 Dec 2022 01:45:15 GMT
- Title: Analysis and Utilization of Entrainment on Acoustic and Emotion Features
in User-agent Dialogue
- Authors: Daxin Tan, Nikos Kargas, David McHardy, Constantinos Papayiannis,
Antonio Bonafonte, Marek Strelec, Jonas Rohnke, Agis Oikonomou Filandras,
Trevor Wood
- Abstract summary: We first examine the existence of the entrainment phenomenon in human-to-human dialogues.
The analysis results show strong evidence of entrainment in terms of both acoustic and emotion features.
We implement two entrainment policies and assess if the integration of entrainment principle into a Text-to-Speech (TTS) system improves the synthesis performance and the user experience.
- Score: 8.933468765800518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entrainment is the phenomenon by which an interlocutor adapts their speaking
style to align with their partner in conversations. It has been found in
different dimensions as acoustic, prosodic, lexical or syntactic. In this work,
we explore and utilize the entrainment phenomenon to improve spoken dialogue
systems for voice assistants. We first examine the existence of the entrainment
phenomenon in human-to-human dialogues in respect to acoustic feature and then
extend the analysis to emotion features. The analysis results show strong
evidence of entrainment in terms of both acoustic and emotion features. Based
on this findings, we implement two entrainment policies and assess if the
integration of entrainment principle into a Text-to-Speech (TTS) system
improves the synthesis performance and the user experience. It is found that
the integration of the entrainment principle into a TTS system brings
performance improvement when considering acoustic features, while no obvious
improvement is observed when considering emotion features.
Related papers
- Acknowledgment of Emotional States: Generating Validating Responses for
Empathetic Dialogue [21.621844911228315]
This study introduces the first framework designed to engender empathetic dialogue with validating responses.
Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation.
arXiv Detail & Related papers (2024-02-20T07:20:03Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition.
We investigated this approach on both speech transcriptions and acoustic segments.
According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.