C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation
- URL: http://arxiv.org/abs/2306.15245v3
- Date: Fri, 1 Sep 2023 16:11:40 GMT
- Title: C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation
- Authors: Liliang Ren, Mankeerat Sidhu, Qi Zeng, Revanth Gangi Reddy, Heng Ji,
ChengXiang Zhai
- Abstract summary: We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
- Score: 68.59356746305255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing reference-free turn-level evaluation metrics for chatbots
inadequately capture the interaction between the user and the system.
Consequently, they often correlate poorly with human evaluations. To address
this issue, we propose a novel model-agnostic approach that leverages
Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level
interaction between the system and the user based on a given evaluation
dimension. Experimental results on the widely used FED dialogue evaluation
dataset demonstrate that our approach significantly improves the correlation
with human judgment compared with existing evaluation systems. By replacing the
negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve
a relative 62.6% higher Spearman correlation on average for the FED evaluation
metric. Our code is publicly available at https://github.com/renll/C-PMI.
Related papers
- MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation [0.4857223913212445]
We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems.
We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
arXiv Detail & Related papers (2024-10-16T12:24:42Z) - CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems [43.5428962271088]
We propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses.
Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements.
arXiv Detail & Related papers (2024-06-25T06:08:16Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - INFACT: An Online Human Evaluation Framework for Conversational
Recommendation [5.837881923712394]
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations.
Current research on machine learning-based CRS models acknowledges the importance of humans in the evaluation process.
arXiv Detail & Related papers (2022-09-07T15:16:59Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.