DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation
- URL: http://arxiv.org/abs/2305.04720v2
- Date: Thu, 25 May 2023 11:40:59 GMT
- Title: DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation
- Authors: ChaeHun Park, Seungil Chad Lee, Daniel Rim, and Jaegul Choo
- Abstract summary: We propose DEnsity, which evaluates a response by utilizing density estimation on the feature space derived from a neural classifier.
Our metric measures how likely a response would appear in the distribution of human conversations.
- Score: 24.224114300690758
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite the recent advances in open-domain dialogue systems, building a
reliable evaluation metric is still a challenging problem. Recent studies
proposed learnable metrics based on classification models trained to
distinguish the correct response. However, neural classifiers are known to make
overly confident predictions for examples from unseen distributions. We propose
DEnsity, which evaluates a response by utilizing density estimation on the
feature space derived from a neural classifier. Our metric measures how likely
a response would appear in the distribution of human conversations. Moreover,
to improve the performance of DEnsity, we utilize contrastive learning to
further compress the feature space. Experiments on multiple response evaluation
datasets show that DEnsity correlates better with human evaluations than the
existing metrics. Our code is available at https://github.com/ddehun/DEnsity.
Related papers
- Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Improving Dialog Evaluation with a Multi-reference Adversarial Dataset
and Large Scale Pretraining [18.174086416883412]
We introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context.
We show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives.
We propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset.
arXiv Detail & Related papers (2020-09-23T18:06:52Z) - Neural Methods for Point-wise Dependency Estimation [129.93860669802046]
We focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur.
We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
arXiv Detail & Related papers (2020-06-09T23:26:15Z) - Calibrated neighborhood aware confidence measure for deep metric
learning [0.0]
Deep metric learning has been successfully applied to problems in few-shot learning, image retrieval, and open-set classifications.
measuring the confidence of a deep metric learning model and identifying unreliable predictions is still an open challenge.
This paper focuses on defining a calibrated and interpretable confidence metric that closely reflects its classification accuracy.
arXiv Detail & Related papers (2020-06-08T21:05:38Z) - Towards GAN Benchmarks Which Require Generalization [48.075521136623564]
We argue that estimating the function must require a large sample from the model.
We turn to neural network divergences (NNDs) which are defined in terms of a neural network trained to distinguish between distributions.
The resulting benchmarks cannot be "won" by training set memorization, while still being perceptually correlated and computable only from samples.
arXiv Detail & Related papers (2020-01-10T20:18:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.