Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue
Evaluation
- URL: http://arxiv.org/abs/2309.07998v1
- Date: Thu, 14 Sep 2023 19:19:50 GMT
- Title: Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue
Evaluation
- Authors: Sarah E. Finch, James D. Finch, Jinho D. Choi
- Abstract summary: This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups.
Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups.
- Score: 13.651502777079237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation has been widely accepted as the standard for evaluating
chat-oriented dialogue systems. However, there is a significant variation in
previous work regarding who gets recruited as evaluators. Evaluator groups such
as domain experts, university students, and professional annotators have been
used to assess and compare dialogue systems, although it is unclear to what
extent the choice of an evaluator group can affect results. This paper analyzes
the evaluator group impact on dialogue system evaluation by testing 4
state-of-the-art dialogue systems using 4 distinct evaluator groups. Our
analysis reveals a robustness towards evaluator groups for Likert evaluations
that is not seen for Pairwise, with only minor differences observed when
changing evaluator groups. Furthermore, two notable limitations to this
robustness are observed, which reveal discrepancies between evaluators with
different levels of chatbot expertise and indicate that evaluator objectivity
is beneficial for certain dialogue metrics.
Related papers
- Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue
Systems [26.003947740875482]
We investigate the relationship between user behaviors and subjective evaluation scores in social dialogue tasks.
The results reveal that in dialogue tasks where user utterances are primary, like attentive listening and job interview, indicators like the number of utterances and words play a significant role in evaluation.
arXiv Detail & Related papers (2024-01-10T01:02:26Z) - Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain
Dialogue Systems [14.98159964397052]
We analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective.
The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments.
The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback.
arXiv Detail & Related papers (2024-01-04T13:15:41Z) - Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores
from Turn-level Scores [17.791039417061565]
We propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act)
We demonstrate the potential effectiveness of our evaluation method through a case study.
arXiv Detail & Related papers (2023-09-30T15:14:50Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue
Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well.
Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation.
We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.