Improving Dialog Evaluation with a Multi-reference Adversarial Dataset
and Large Scale Pretraining
- URL: http://arxiv.org/abs/2009.11321v1
- Date: Wed, 23 Sep 2020 18:06:52 GMT
- Title: Improving Dialog Evaluation with a Multi-reference Adversarial Dataset
and Large Scale Pretraining
- Authors: Ananya B. Sai, Akash Kumar Mohankumar, Siddhartha Arora, Mitesh M.
Khapra
- Abstract summary: We introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context.
We show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives.
We propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset.
- Score: 18.174086416883412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an increasing focus on model-based dialog evaluation metrics such as
ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign
a high score to all relevant responses and a low score to all irrelevant
responses. Ideally, such models should be trained using multiple relevant and
irrelevant responses for any given context. However, no such data is publicly
available, and hence existing models are usually trained using a single
relevant response and multiple randomly selected responses from other contexts
(random negatives). To allow for better training and robust evaluation of
model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i)
five relevant responses for each context and (ii) five adversarially crafted
irrelevant responses for each context. Using this dataset, we first show that
even in the presence of multiple correct references, n-gram based metrics and
embedding based metrics do not perform well at separating relevant responses
from even random negatives. While model-based metrics perform better than
n-gram and embedding based metrics on random negatives, their performance drops
substantially when evaluated on adversarial examples. To check if large scale
pretraining could help, we propose a new BERT-based evaluation metric called
DEB, which is pretrained on 727M Reddit conversations and then finetuned on our
dataset. DEB significantly outperforms existing models, showing better
correlation with human judgements and better performance on random negatives
(88.27% accuracy). However, its performance again drops substantially, when
evaluated on adversarial responses, thereby highlighting that even large-scale
pretrained evaluation models are not robust to the adversarial examples in our
dataset. The dataset and code are publicly available.
Related papers
- Do Smaller Language Models Answer Contextualised Questions Through
Memorisation Or Generalisation? [8.51696622847778]
A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples.
We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers.
arXiv Detail & Related papers (2023-11-21T04:06:08Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - AB/BA analysis: A framework for estimating keyword spotting recall
improvement while maintaining audio privacy [0.0]
KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives.
We propose an evaluation technique which we call AB/BA analysis.
We show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
arXiv Detail & Related papers (2022-04-18T13:52:22Z) - Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z) - Logical Reasoning for Task Oriented Dialogue Systems [57.440956636333325]
We propose a novel method to fine-tune transformer models such as Roberta and T5 to reason over a set of facts in a given dialogue context.
Our method includes a synthetic data generation mechanism which helps the model learn logical relations.
We show that the transformer based model can perform logical reasoning to answer questions when the dialogue context contains all the required information.
arXiv Detail & Related papers (2022-02-08T21:46:27Z) - Identifying Untrustworthy Samples: Data Filtering for Open-domain
Dialogues with Bayesian Optimization [28.22184410167622]
We present a data filtering method for open-domain dialogues.
We score training samples with a quality measure, sort them in descending order, and filter out those at the bottom.
Experimental results on two datasets show that our method can effectively identify untrustworthy samples.
arXiv Detail & Related papers (2021-09-14T06:42:54Z) - Synthesizing Adversarial Negative Responses for Robust Response Ranking
and Evaluation [34.52276336319678]
Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks.
Over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies.
We propose approaches for automatically creating adversarial negative training data.
arXiv Detail & Related papers (2021-06-10T16:20:55Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.