Improving Open-Domain Dialogue Evaluation with a Causal Inference Model
- URL: http://arxiv.org/abs/2301.13372v1
- Date: Tue, 31 Jan 2023 02:31:42 GMT
- Title: Improving Open-Domain Dialogue Evaluation with a Causal Inference Model
- Authors: Cat P. Le, Luke Dai, Michael Johnston, Yang Liu, Marilyn Walker, Reza
Ghanadan
- Abstract summary: Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked.
Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect.
Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues.
- Score: 8.625569782672663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective evaluation methods remain a significant challenge for research on
open-domain conversational dialogue systems. Explicit satisfaction ratings can
be elicited from users, but users often do not provide ratings when asked, and
those they give can be highly subjective. Post-hoc ratings by experts are an
alternative, but these can be both expensive and complex to collect. Here, we
explore the creation of automated methods for predicting both expert and user
ratings of open-domain dialogues. We compare four different approaches. First,
we train a baseline model using an end-to-end transformer to predict ratings
directly from the raw dialogue text. The other three methods are variants of a
two-stage approach in which we first extract interpretable features at the turn
level that capture, among other aspects, user dialogue behaviors indicating
contradiction, repetition, disinterest, compliments, or criticism. We project
these features to the dialogue level and train a dialogue-level MLP regression
model, a dialogue-level LSTM, and a novel causal inference model called
counterfactual-LSTM (CF-LSTM) to predict ratings. The proposed CF-LSTM is a
sequential model over turn-level features which predicts ratings using multiple
regressors depending on hypotheses derived from the turn-level features. As a
causal inference model, CF-LSTM aims to learn the underlying causes of a
specific event, such as a low rating. We also bin the user ratings and perform
classification experiments with all four models. In evaluation experiments on
conversational data from the Alexa Prize SocialBot, we show that the CF-LSTM
achieves the best performance for predicting dialogue ratings and
classification.
Related papers
- Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z) - Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation [69.03658685761538]
Open Domain dialog system evaluation is one of the most important challenges in dialog research.
We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them.
Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
arXiv Detail & Related papers (2020-05-21T15:14:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.