Spot The Bot: A Robust and Efficient Framework for the Evaluation of
Conversational Dialogue Systems
- URL: http://arxiv.org/abs/2010.02140v1
- Date: Mon, 5 Oct 2020 16:37:52 GMT
- Title: Spot The Bot: A Robust and Efficient Framework for the Evaluation of
Conversational Dialogue Systems
- Authors: Jan Deriu and Don Tuggener and Pius von D\"aniken and Jon Ander Campos
and Alvaro Rodrigo and Thiziri Belkacem and Aitor Soroa and Eneko Agirre and
Mark Cieliebak
- Abstract summary: emphSpot The Bot replaces human-bot conversations with conversations between bots.
Human judges only annotate for each entity in a conversation whether they think it is human or not.
emphSurvival Analysis measures which bot can uphold human-like behavior the longest.
- Score: 21.36935947626793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of time-efficient and reliable evaluation methods hamper the
development of conversational dialogue systems (chatbots). Evaluations
requiring humans to converse with chatbots are time and cost-intensive, put
high cognitive demands on the human judges, and yield low-quality results. In
this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust
evaluation framework that replaces human-bot conversations with conversations
between bots. Human judges then only annotate for each entity in a conversation
whether they think it is human or not (assuming there are humans participants
in these conversations). These annotations then allow us to rank chatbots
regarding their ability to mimic the conversational behavior of humans. Since
we expect that all bots are eventually recognized as such, we incorporate a
metric that measures which chatbot can uphold human-like behavior the longest,
i.e., \emph{Survival Analysis}. This metric has the ability to correlate a
bot's performance to certain of its characteristics (e.g., \ fluency or
sensibleness), yielding interpretable results. The comparably low cost of our
framework allows for frequent evaluations of chatbots during their evaluation
cycle. We empirically validate our claims by applying \emph{Spot The Bot} to
three domains, evaluating several state-of-the-art chatbots, and drawing
comparisons to related work. The framework is released as a ready-to-use tool.
Related papers
- LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction.
Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - Neural Generation Meets Real People: Building a Social, Informative
Open-Domain Dialogue Agent [65.68144111226626]
Chirpy Cardinal aims to be both informative and conversational.
We let both the user and bot take turns driving the conversation.
Chirpy Cardinal placed second out of nine bots in the Alexa Prize Socialbot Grand Challenge.
arXiv Detail & Related papers (2022-07-25T09:57:23Z) - A Deep Learning Approach to Integrate Human-Level Understanding in a
Chatbot [0.4632366780742501]
Unlike humans, chatbots can serve multiple customers at a time, are available 24/7 and reply in less than a fraction of a second.
We performed sentiment analysis, emotion detection, intent classification and named-entity recognition using deep learning to develop chatbots with humanistic understanding and intelligence.
arXiv Detail & Related papers (2021-12-31T22:26:41Z) - EmpBot: A T5-based Empathetic Chatbot focusing on Sentiments [75.11753644302385]
Empathetic conversational agents should not only understand what is being discussed, but also acknowledge the implied feelings of the conversation partner.
We propose a method based on a transformer pretrained language model (T5)
We evaluate our model on the EmpatheticDialogues dataset using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-10-30T19:04:48Z) - CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement
Learning [60.348822346249854]
This study presents a framework whereby several empathetic chatbots are based on understanding users' implied feelings and replying empathetically for multiple dialogue turns.
We call these chatbots CheerBots. CheerBots can be retrieval-based or generative-based and were finetuned by deep reinforcement learning.
To respond in an empathetic way, we develop a simulating agent, a Conceptual Human Model, as aids for CheerBots in training with considerations on changes in user's emotional states in the future to arouse sympathy.
arXiv Detail & Related papers (2021-10-08T07:44:47Z) - Addressing Inquiries about History: An Efficient and Practical Framework
for Evaluating Open-domain Chatbot Consistency [28.255324166852535]
We propose the Addressing Inquiries about History (AIH) framework for the consistency evaluation.
At the conversation stage, AIH attempts to address appropriate inquiries about the dialogue history to induce the chatbots to redeclare the historical facts or opinions.
At the contradiction recognition stage, we can either employ human judges or a natural language inference (NLI) model to recognize whether the answers to the inquiries are contradictory with history.
arXiv Detail & Related papers (2021-06-04T03:04:13Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z) - FinChat: Corpus and evaluation setup for Finnish chat conversations on
everyday topics [15.94497202872835]
We describe our collection efforts to create the Finnish chat conversation corpus FinChat, made available publicly.
FinChat includes unscripted conversations on seven topics from people of different ages.
In a human evaluation, responses to questions from the evaluation set generated by the chatbots are predominantly marked as incoherent.
arXiv Detail & Related papers (2020-08-19T07:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.