Exploring Emotion Expression Recognition in Older Adults Interacting
with a Virtual Coach
- URL: http://arxiv.org/abs/2311.05567v1
- Date: Thu, 9 Nov 2023 18:22:32 GMT
- Title: Exploring Emotion Expression Recognition in Older Adults Interacting
with a Virtual Coach
- Authors: Cristina Palmero, Mikel deVelasco, Mohamed Amine Hmani, Aymen Mtibaa,
Leila Ben Letaifa, Pau Buch-Cardona, Raquel Justo, Terry Amorese, Eduardo
Gonz\'alez-Fraile, Bego\~na Fern\'andez-Ruanova, Jofre Tenorio-Laranga, Anna
Torp Johansen, Micaela Rodrigues da Silva, Liva Jenny Martinussen, Maria
Stylianou Korsnes, Gennaro Cordasco, Anna Esposito, Mounim A. El-Yacoubi,
Dijana Petrovska-Delacr\'etaz, M. In\'es Torres and Sergio Escalera
- Abstract summary: EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging.
This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach.
- Score: 22.00225071959289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The EMPATHIC project aimed to design an emotionally expressive virtual coach
capable of engaging healthy seniors to improve well-being and promote
independent aging. One of the core aspects of the system is its human sensing
capabilities, allowing for the perception of emotional states to provide a
personalized experience. This paper outlines the development of the emotion
expression recognition module of the virtual coach, encompassing data
collection, annotation design, and a first methodological approach, all
tailored to the project requirements. With the latter, we investigate the role
of various modalities, individually and combined, for discrete emotion
expression recognition in this context: speech from audio, and facial
expressions, gaze, and head dynamics from video. The collected corpus includes
users from Spain, France, and Norway, and was annotated separately for the
audio and video channels with distinct emotional labels, allowing for a
performance comparison across cultures and label types. Results confirm the
informative power of the modalities studied for the emotional categories
considered, with multimodal methods generally outperforming others (around 68%
accuracy with audio labels and 72-74% with video labels). The findings are
expected to contribute to the limited literature on emotion recognition applied
to older adults in conversational human-machine interaction.
Related papers
- Dual-path Collaborative Generation Network for Emotional Video Captioning [33.230028098522254]
Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos.
Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation.
We propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions.
arXiv Detail & Related papers (2024-08-06T07:30:53Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction.
To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network.
Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z) - Affective Image Content Analysis: Two Decades Review and New
Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades.
We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence.
We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z) - EmoDNN: Understanding emotions from short texts through a deep neural
network ensemble [2.459874436804819]
We propose a framework that infers latent individual aspects from brief contents.
We also present a novel ensemble classifier equipped with dynamic dropout convnets to extract emotions from textual context.
Our proposed model can achieve a higher performance in recognizing emotion from noisy contents.
arXiv Detail & Related papers (2021-06-03T09:17:34Z) - Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network
for Emotional Conversation Generation [25.808037796936766]
In a real-world conversation, we instinctively perceive emotions from multi-source information.
We propose a heterogeneous graph-based model for emotional conversation generation.
Experimental results show that our model can effectively perceive emotions from multi-source knowledge.
arXiv Detail & Related papers (2020-12-09T06:09:31Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - Speech Driven Talking Face Generation from a Single Image and an Emotion
Condition [28.52180268019401]
We propose a novel approach to rendering visual emotion expression in speech-driven talking face generation.
We design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input.
Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system.
arXiv Detail & Related papers (2020-08-08T20:46:31Z) - Temporal aggregation of audio-visual modalities for emotion recognition [0.5352699766206808]
We propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality.
Our proposed method outperforms other methods from the literature and human accuracy rating.
arXiv Detail & Related papers (2020-07-08T18:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.