On the Use of Linguistic Features for the Evaluation of Generative
Dialogue Systems
- URL: http://arxiv.org/abs/2104.06335v1
- Date: Tue, 13 Apr 2021 16:28:00 GMT
- Title: On the Use of Linguistic Features for the Evaluation of Generative
Dialogue Systems
- Authors: Ian Berlot-Attwell and Frank Rudzicz
- Abstract summary: We propose that a metric based on linguistic features may be able to maintain good correlation with human judgment and be interpretable.
To support this proposition, we measure and analyze various linguistic features on dialogues produced by multiple dialogue models.
We find that the features' behaviour is consistent with the known properties of the models tested, and is similar across domains.
- Score: 17.749995931459136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically evaluating text-based, non-task-oriented dialogue systems
(i.e., `chatbots') remains an open problem. Previous approaches have suffered
challenges ranging from poor correlation with human judgment to poor
generalization and have often required a gold standard reference for comparison
or human-annotated data. Extending existing evaluation methods, we propose that
a metric based on linguistic features may be able to maintain good correlation
with human judgment and be interpretable, without requiring a gold-standard
reference or human-annotated data. To support this proposition, we measure and
analyze various linguistic features on dialogues produced by multiple dialogue
models. We find that the features' behaviour is consistent with the known
properties of the models tested, and is similar across domains. We also
demonstrate that this approach exhibits promising properties such as zero-shot
generalization to new domains on the related task of evaluating response
relevance.
Related papers
- CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems [43.5428962271088]
We propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses.
Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements.
arXiv Detail & Related papers (2024-06-25T06:08:16Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark [29.722504033424382]
Knowledge-grounded dialogue agents are systems designed to conduct a conversation based on externally provided background information, such as a Wikipedia page.
We introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN)
BEGIN consists of 8113 dialogue turns generated by language-model-based dialogue systems, accompanied by humans annotations specifying the relationship between the system's response and the background information.
arXiv Detail & Related papers (2021-04-30T20:17:52Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - I like fish, especially dolphins: Addressing Contradictions in Dialogue
Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.