Related papers: AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

URL: http://arxiv.org/abs/2509.03212v1
Date: Wed, 03 Sep 2025 11:00:46 GMT
Title: AIVA: An AI-based Virtual Companion for Emotion-aware Interaction
Authors: Chenxi Li,
Abstract summary: ours is an AI-based virtual companion that captures multimodal sentiment cues.<n>ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.
Score: 10.811567597962453
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in Large Language Models (LLMs) have significantly improved natural language understanding and generation, enhancing Human-Computer Interaction (HCI). However, LLMs are limited to unimodal text processing and lack the ability to interpret emotional cues from non-verbal signals, hindering more immersive and empathetic interactions. This work explores integrating multimodal sentiment perception into LLMs to create emotion-aware agents. We propose \ours, an AI-based virtual companion that captures multimodal sentiment cues, enabling emotionally aligned and animated HCI. \ours introduces a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning to provide emotional cues. Additionally, we develop an emotion-aware prompt engineering strategy for generating empathetic responses and integrate a Text-to-Speech (TTS) system and animated avatar module for expressive interactions. \ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.

Related papers

Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots [7.665995147018354]
We present textitSeM$2$, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions.<n>We implement both cloud-based and underlinetextitedge-deployed versions (textitSeM$2_e$), with the latter knowledge distilled to operate efficiently on edge hardware.<n> Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence.
arXiv Detail & Related papers (2026-02-07T08:32:54Z)
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
We propose the Emotional Rationale Verifier (ERV) and an Explanation Reward.<n>Our method guides the model to produce reasoning that is explicitly consistent with the target emotion.<n>We show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions.
arXiv Detail & Related papers (2025-10-27T16:40:17Z)
Teaching AI to Feel: A Collaborative, Full-Body Exploration of Emotive Communication [0.0]
Commonaiverse is an interactive installation exploring human emotions through full-body motion tracking and real-time AI feedback.<n>We discuss how this collaborative, out-of-the-box approach pushes multimedia research toward a more embodied, co-created paradigm of emotional AI.
arXiv Detail & Related papers (2025-09-26T10:28:56Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [61.989360995528905]
We propose UDDETTS, a universal framework unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z)
AI with Emotions: Exploring Emotional Expressions in Large Language Models [0.0]
Large Language Models (LLMs) play role-play as agents answering questions with specified emotional states.<n>Russell's Circumplex model characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes.<n> evaluation showed that the emotional states of the generated answers were consistent with the specifications.
arXiv Detail & Related papers (2025-04-20T18:49:25Z)
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering [13.775516653315103]
Social intelligence is essential for effective communication and adaptive responses.<n>Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques.<n>We propose the Looped Video Debating framework, which integrates Large Language Models with visual information.
arXiv Detail & Related papers (2025-03-27T06:14:21Z)
Toward a Dialogue System Using a Large Language Model to Recognize User Emotions with a Camera [0.0]
Methods for AI agents to recognize emotions from the user's facial expressions have not been studied.<n>We examined whether or not LLM-based AI agents can interact with users according to their emotional states by capturing the user in dialogue with a camera.<n>Results confirmed that AI agents can have conversations according to the emotional state for emotional states with relatively high scores, such as Happy and Angry.
arXiv Detail & Related papers (2024-08-15T07:03:00Z)
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks. But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z)
Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data. We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.