Towards Objective Evaluation of Socially-Situated Conversational Robots:
Assessing Human-Likeness through Multimodal User Behaviors
- URL: http://arxiv.org/abs/2308.11020v2
- Date: Mon, 25 Sep 2023 12:10:30 GMT
- Title: Towards Objective Evaluation of Socially-Situated Conversational Robots:
Assessing Human-Likeness through Multimodal User Behaviors
- Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze
- Abstract summary: This paper focuses on assessing the human-likeness of the robot as the primary evaluation metric.
Our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and objectivity.
- Score: 26.003947740875482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper tackles the challenging task of evaluating socially situated
conversational robots and presents a novel objective evaluation approach that
relies on multimodal user behaviors. In this study, our main focus is on
assessing the human-likeness of the robot as the primary evaluation metric.
While previous research often relied on subjective evaluations from users, our
approach aims to evaluate the robot's human-likeness based on observable user
behaviors indirectly, thus enhancing objectivity and reproducibility. To begin,
we created an annotated dataset of human-likeness scores, utilizing user
behaviors found in an attentive listening dialogue corpus. We then conducted an
analysis to determine the correlation between multimodal user behaviors and
human-likeness scores, demonstrating the feasibility of our proposed
behavior-based evaluation method.
Related papers
- Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models [26.333097337393685]
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers.
Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings.
First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours.
Second, we present a scalable, automated approach by employing simulations of user interactions.
Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions.
arXiv Detail & Related papers (2025-02-10T22:09:57Z) - Objective Metrics for Human-Subjects Evaluation in Explainable Reinforcement Learning [0.47355466227925036]
Explanation is a fundamentally human process. Understanding the goal and audience of the explanation is vital.
Existing work on explainable reinforcement learning (XRL) routinely does not consult humans in their evaluations.
This paper calls on researchers to use objective human metrics for explanation evaluations based on observable and actionable behaviour.
arXiv Detail & Related papers (2025-01-31T16:12:23Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - Gaze-based intention estimation: principles, methodologies, and
applications in HRI [0.0]
This review aims to draw a line between insights in the psychological literature on visuomotor control and relevant applications of gaze-based intention recognition.
The use of eye tracking and gaze-based models for intent recognition in Human-Robot Interaction is considered.
arXiv Detail & Related papers (2023-02-09T09:44:13Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.