AIVA: An AI-based Virtual Companion for Emotion-aware Interaction
- URL: http://arxiv.org/abs/2509.03212v1
- Date: Wed, 03 Sep 2025 11:00:46 GMT
- Title: AIVA: An AI-based Virtual Companion for Emotion-aware Interaction
- Authors: Chenxi Li,
- Abstract summary: ours is an AI-based virtual companion that captures multimodal sentiment cues.<n>ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.
- Score: 10.811567597962453
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) have significantly improved natural language understanding and generation, enhancing Human-Computer Interaction (HCI). However, LLMs are limited to unimodal text processing and lack the ability to interpret emotional cues from non-verbal signals, hindering more immersive and empathetic interactions. This work explores integrating multimodal sentiment perception into LLMs to create emotion-aware agents. We propose \ours, an AI-based virtual companion that captures multimodal sentiment cues, enabling emotionally aligned and animated HCI. \ours introduces a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning to provide emotional cues. Additionally, we develop an emotion-aware prompt engineering strategy for generating empathetic responses and integrate a Text-to-Speech (TTS) system and animated avatar module for expressive interactions. \ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.
Related papers
- Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots [7.665995147018354]
We present textitSeM$2$, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions.<n>We implement both cloud-based and underlinetextitedge-deployed versions (textitSeM$2_e$), with the latter knowledge distilled to operate efficiently on edge hardware.<n> Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence.
arXiv Detail & Related papers (2026-02-07T08:32:54Z) - Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
We propose the Emotional Rationale Verifier (ERV) and an Explanation Reward.<n>Our method guides the model to produce reasoning that is explicitly consistent with the target emotion.<n>We show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions.
arXiv Detail & Related papers (2025-10-27T16:40:17Z) - Teaching AI to Feel: A Collaborative, Full-Body Exploration of Emotive Communication [0.0]
Commonaiverse is an interactive installation exploring human emotions through full-body motion tracking and real-time AI feedback.<n>We discuss how this collaborative, out-of-the-box approach pushes multimedia research toward a more embodied, co-created paradigm of emotional AI.
arXiv Detail & Related papers (2025-09-26T10:28:56Z) - EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z) - UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [61.989360995528905]
We propose UDDETTS, a universal framework unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z) - AI with Emotions: Exploring Emotional Expressions in Large Language Models [0.0]
Large Language Models (LLMs) play role-play as agents answering questions with specified emotional states.<n>Russell's Circumplex model characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes.<n> evaluation showed that the emotional states of the generated answers were consistent with the specifications.
arXiv Detail & Related papers (2025-04-20T18:49:25Z) - Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering [13.775516653315103]
Social intelligence is essential for effective communication and adaptive responses.<n>Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques.<n>We propose the Looped Video Debating framework, which integrates Large Language Models with visual information.
arXiv Detail & Related papers (2025-03-27T06:14:21Z) - Toward a Dialogue System Using a Large Language Model to Recognize User Emotions with a Camera [0.0]
Methods for AI agents to recognize emotions from the user's facial expressions have not been studied.<n>We examined whether or not LLM-based AI agents can interact with users according to their emotional states by capturing the user in dialogue with a camera.<n>Results confirmed that AI agents can have conversations according to the emotional state for emotional states with relatively high scores, such as Happy and Angry.
arXiv Detail & Related papers (2024-08-15T07:03:00Z) - EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks.
But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored.
EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.