Deconstruct to Reconstruct a Configurable Evaluation Metric for
Open-Domain Dialogue Systems
- URL: http://arxiv.org/abs/2011.00483v1
- Date: Sun, 1 Nov 2020 11:34:50 GMT
- Title: Deconstruct to Reconstruct a Configurable Evaluation Metric for
Open-Domain Dialogue Systems
- Authors: Vitou Phy, Yang Zhao and Akiko Aizawa
- Abstract summary: In open-domain dialogue, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy.
Existing metrics are not designed to cope with such flexibility.
We propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H.
- Score: 36.73648357051916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many automatic evaluation metrics have been proposed to score the overall
quality of a response in open-domain dialogue. Generally, the overall quality
is comprised of various aspects, such as relevancy, specificity, and empathy,
and the importance of each aspect differs according to the task. For instance,
specificity is mandatory in a food-ordering dialogue task, whereas fluency is
preferred in a language-teaching dialogue system. However, existing metrics are
not designed to cope with such flexibility. For example, BLEU score
fundamentally relies only on word overlapping, whereas BERTScore relies on
semantic similarity between reference and candidate response. Thus, they are
not guaranteed to capture the required aspects, i.e., specificity. To design a
metric that is flexible to a task, we first propose making these qualities
manageable by grouping them into three groups: understandability, sensibleness,
and likability, where likability is a combination of qualities that are
essential for a task. We also propose a simple method to composite metrics of
each aspect to obtain a single metric called USL-H, which stands for
Understandability, Sensibleness, and Likability in Hierarchy. We demonstrated
that USL-H score achieves good correlations with human judgment and maintains
its configurability towards different aspects and metrics.
Related papers
- Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG)
To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion.
Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z) - NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric
Preference Checklist [20.448405494617397]
Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks.
Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective.
We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks.
arXiv Detail & Related papers (2023-05-15T11:51:55Z) - PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and
Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually.
We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - Meta-evaluation of Conversational Search Evaluation Metrics [15.942419892035124]
We systematically meta-evaluate a variety of conversational search metrics.
We find that METEOR is the best existing single-turn metric considering all three perspectives.
We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search.
arXiv Detail & Related papers (2021-04-27T20:01:03Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.