AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
- URL: http://arxiv.org/abs/2412.17292v1
- Date: Mon, 23 Dec 2024 05:24:26 GMT
- Title: AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
- Authors: Se Jin Park, Yeonju Kim, Hyeongseop Rha, Bella Godiva, Yong Man Ro,
- Abstract summary: We present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions.
AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner.
- Score: 37.96886343501444
- License:
- Abstract: In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.
Related papers
- Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data [33.85748258158527]
Empathetic dialogue is crucial for natural human-computer interaction.
Large language models (LLMs) have revolutionized dialogue generation by harnessing their powerful capabilities.
We propose a novel approach that circumvents the need for question-answering data.
arXiv Detail & Related papers (2025-01-19T04:10:53Z) - Personality-affected Emotion Generation in Dialog Systems [67.40609683389947]
We propose a new task, Personality-affected Emotion Generation, to generate emotion based on the personality given to the dialog system.
We analyze the challenges in this task, i.e., (1) heterogeneously integrating personality and emotional factors and (2) extracting multi-granularity emotional information in the dialog context.
Results suggest that by adopting our method, the emotion generation performance is improved by 13% in macro-F1 and 5% in weighted-F1 from the BERT-base model.
arXiv Detail & Related papers (2024-04-03T08:48:50Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset
for Conversational AI [48.67259855309959]
Most existing datasets for conversational AI ignore human personalities and emotions.
We propose CPED, a large-scale Chinese personalized and emotional dialogue dataset.
CPED contains more than 12K dialogues of 392 speakers from 40 TV shows.
arXiv Detail & Related papers (2022-05-29T17:45:12Z) - EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion in
Task-Oriented Dialogue Systems [3.3010169113961325]
EmoWOZ is a large-scale manually emotion-annotated corpus of task-oriented dialogues.
It contains more than 11K dialogues with more than 83K emotion annotations of user utterances.
We propose a novel emotion labelling scheme, which is tailored to task-oriented dialogues.
arXiv Detail & Related papers (2021-09-10T15:00:01Z) - Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network
for Emotional Conversation Generation [25.808037796936766]
In a real-world conversation, we instinctively perceive emotions from multi-source information.
We propose a heterogeneous graph-based model for emotional conversation generation.
Experimental results show that our model can effectively perceive emotions from multi-source knowledge.
arXiv Detail & Related papers (2020-12-09T06:09:31Z) - Knowledge Bridging for Empathetic Dialogue Generation [52.39868458154947]
Lack of external knowledge makes empathetic dialogue systems difficult to perceive implicit emotions and learn emotional interactions from limited dialogue history.
We propose to leverage external knowledge, including commonsense knowledge and emotional lexical knowledge, to explicitly understand and express emotions in empathetic dialogue generation.
arXiv Detail & Related papers (2020-09-21T09:21:52Z) - Speech Driven Talking Face Generation from a Single Image and an Emotion
Condition [28.52180268019401]
We propose a novel approach to rendering visual emotion expression in speech-driven talking face generation.
We design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input.
Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system.
arXiv Detail & Related papers (2020-08-08T20:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.