Maia: A Real-time Non-Verbal Chat for Human-AI Interaction
- URL: http://arxiv.org/abs/2402.06385v2
- Date: Tue, 10 Dec 2024 11:27:39 GMT
- Title: Maia: A Real-time Non-Verbal Chat for Human-AI Interaction
- Authors: Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu,
- Abstract summary: We propose an alternative to text-based Human-AI interaction.<n>By leveraging nonverbal visual communication, through facial expressions, head and body movements, we aim to enhance engagement.<n>Our approach is not art-specific and can be adapted to various paintings, animations, and avatars.
- Score: 10.580858171606167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modeling face-to-face communication in computer vision, which focuses on recognizing and analyzing nonverbal cues and behaviors during interactions, serves as the foundation for our proposed alternative to text-based Human-AI interaction. By leveraging nonverbal visual communication, through facial expressions, head and body movements, we aim to enhance engagement and capture the user's attention through a novel improvisational element, that goes beyond mirroring gestures. Our goal is to track and analyze facial expressions, and other nonverbal cues in real-time, and use this information to build models that can predict and understand human behavior. Operating in real-time and requiring minimal computational resources, our approach signifies a major leap forward in making AI interactions more natural and accessible. We offer three different complementary approaches, based on retrieval, statistical, and deep learning techniques. A key novelty of our work is the integration of an artistic component atop an efficient human-computer interaction system, using art as a medium to transmit emotions. Our approach is not art-specific and can be adapted to various paintings, animations, and avatars. In our experiments, we compare state-of-the-art diffusion models as mediums for emotion translation in 2D, and our 3D avatar, Maia, that we introduce in this work, with not just facial movements but also body motions for a more natural and engaging experience. We demonstrate the effectiveness of our approach in translating AI-generated emotions into human-relatable expressions, through both human and automatic evaluation procedures, highlighting its potential to significantly enhance the naturalness and engagement of Human-AI interactions across various applications.
Related papers
- A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations [0.9580312063277943]
Recent developments in Artificial Intelligence (AI) and Machine Learning (ML) are creating new opportunities for Human-Autonomy Teaming (HAT)
We present a real-time Human Digital Twin (HDT) architecture that integrates Large Language Models (LLMs) for knowledge reporting, answering, and recommendation, embodied in a visual interface.
The HDT acts as a visually and behaviorally realistic team member, integrated throughout the mission lifecycle, from training to deployment to after-action review.
arXiv Detail & Related papers (2025-04-04T03:56:26Z) - "Only ChatGPT gets me": An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text [2.6012482282204004]
This work investigates the capabilities of large language models (LLMs) in detecting and understanding human emotions through text.
By employing a methodology that involves comparisons with a state-of-the-art model on the GoEmotions dataset, we aim to gauge LLMs' effectiveness as a system for emotional analysis.
arXiv Detail & Related papers (2025-03-05T09:47:49Z) - EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning [10.266351600604612]
This paper introduces a framework, called EMOTION, for generating expressive motion sequences in humanoid robots.
We conduct online user studies comparing the naturalness and understandability of the motions generated by EMOTION and its human-feedback version, EMOTION++.
arXiv Detail & Related papers (2024-10-30T17:22:45Z) - Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Facial Expression Recognition using Squeeze and Excitation-powered Swin
Transformers [0.0]
We propose a framework that employs Swin Vision Transformers (SwinT) and squeeze and excitation block (SE) to address vision tasks.
Our focus was to create an efficient FER model based on SwinT architecture that can recognize facial emotions using minimal data.
We trained our model on a hybrid dataset and evaluated its performance on the AffectNet dataset, achieving an F1-score of 0.5420.
arXiv Detail & Related papers (2023-01-26T02:29:17Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - BOSS: A Benchmark for Human Belief Prediction in Object-context
Scenarios [14.23697277904244]
This paper uses the combined knowledge of Theory of Mind (ToM) and Object-Context Relations to investigate methods for enhancing collaboration between humans and autonomous systems.
We propose a novel and challenging multimodal video dataset for assessing the capability of artificial intelligence (AI) systems in predicting human belief states in an object-context scenario.
arXiv Detail & Related papers (2022-06-21T18:29:17Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z) - Multi-Cue Adaptive Emotion Recognition Network [4.570705738465714]
We propose a new deep learning approach for emotion recognition based on adaptive multi-cues.
We compare the proposed approach with the state-of-art approaches in the CAER-S dataset.
arXiv Detail & Related papers (2021-11-03T15:08:55Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Let's be friends! A rapport-building 3D embodied conversational agent
for the Human Support Robot [0.0]
Partial subtle mirroring of nonverbal behaviors during conversations (also known as mimicking or parallel empathy) is essential for rapport building.
Our research question is whether integrating an ECA able to mirror its interlocutor's facial expressions and head movements with a human-service robot will improve the user's experience.
Our contribution is the complex integration of an expressive ECA, able to track its interlocutor's face, and to mirror his/her facial expressions and head movements in real time, integrated with a human support robot.
arXiv Detail & Related papers (2021-03-08T01:02:41Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.