Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark
- URL: http://arxiv.org/abs/2105.00071v1
- Date: Fri, 30 Apr 2021 20:17:52 GMT
- Title: Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark
- Authors: Nouha Dziri, Hannah Rashkin, Tal Linzen, David Reitter
- Abstract summary: Knowledge-grounded dialogue agents are systems designed to conduct a conversation based on externally provided background information, such as a Wikipedia page.
We introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN)
BEGIN consists of 8113 dialogue turns generated by language-model-based dialogue systems, accompanied by humans annotations specifying the relationship between the system's response and the background information.
- Score: 29.722504033424382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-grounded dialogue agents are systems designed to conduct a
conversation based on externally provided background information, such as a
Wikipedia page. Such dialogue agents, especially those based on neural network
language models, often produce responses that sound fluent but are not
justified by the background information. Progress towards addressing this
problem requires developing automatic evaluation metrics that can quantify the
extent to which responses are grounded in background information. To facilitate
evaluation of such metrics, we introduce the Benchmark for Evaluation of
Grounded INteraction (BEGIN). BEGIN consists of 8113 dialogue turns generated
by language-model-based dialogue systems, accompanied by humans annotations
specifying the relationship between the system's response and the background
information. These annotations are based on an extension of the natural
language inference paradigm. We use the benchmark to demonstrate the
effectiveness of adversarially generated data for improving an evaluation
metric based on existing natural language inference datasets.
Related papers
- Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Controllable Mixed-Initiative Dialogue Generation through Prompting [50.03458333265885]
Mixed-initiative dialogue tasks involve repeated exchanges of information and conversational control.
Agents gain control by generating responses that follow particular dialogue intents or strategies, prescribed by a policy planner.
Standard approach has been fine-tuning pre-trained language models to perform generation conditioned on these intents.
We instead prompt large language models as a drop-in replacement to fine-tuning on conditional generation.
arXiv Detail & Related papers (2023-05-06T23:11:25Z) - PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue
Model [79.64376762489164]
PK-Chat is a Pointer network guided generative dialogue model, incorporating a unified pretrained language model and a pointer network over knowledge graphs.
The words generated by PK-Chat in the dialogue are derived from the prediction of word lists and the direct prediction of the external knowledge graph knowledge.
Based on the PK-Chat, a dialogue system is built for academic scenarios in the case of geosciences.
arXiv Detail & Related papers (2023-04-02T18:23:13Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable
Features [16.676172815172166]
We discuss the challenges of training a generative neural dialogue model for such systems that is controlled to stay faithful to the evidence.
Existing datasets contain a mix of conversational responses that are faithful to selected evidence as well as more subjective or chit-chat style responses.
We propose different evaluation measures to disentangle these different styles of responses by quantifying the informativeness and objectivity.
arXiv Detail & Related papers (2021-07-14T19:52:12Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - On the Use of Linguistic Features for the Evaluation of Generative
Dialogue Systems [17.749995931459136]
We propose that a metric based on linguistic features may be able to maintain good correlation with human judgment and be interpretable.
To support this proposition, we measure and analyze various linguistic features on dialogues produced by multiple dialogue models.
We find that the features' behaviour is consistent with the known properties of the models tested, and is similar across domains.
arXiv Detail & Related papers (2021-04-13T16:28:00Z) - Natural Language Understanding for Argumentative Dialogue Systems in the
Opinion Building Domain [6.951113351928047]
This paper introduces a framework for argumentative dialogue systems in the information-seeking domain.
Our approach distinguishes multiple user intents and identifies system arguments the user refers to in his or her natural language utterances.
arXiv Detail & Related papers (2021-03-03T21:17:24Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.