UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation
- URL: http://arxiv.org/abs/2009.07602v1
- Date: Wed, 16 Sep 2020 11:01:46 GMT
- Title: UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation
- Authors: Jian Guan, Minlie Huang
- Abstract summary: UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
- Score: 92.42032403795879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of existing referenced metrics (e.g., BLEU and
MoverScore), they correlate poorly with human judgments for open-ended text
generation including story or dialog generation because of the notorious
one-to-many issue: there are many plausible outputs for the same input, which
may differ substantially in literal or semantics from the limited number of
given references. To alleviate this issue, we propose UNION, a learnable
unreferenced metric for evaluating open-ended story generation, which measures
the quality of a generated story without any reference. Built on top of BERT,
UNION is trained to distinguish human-written stories from negative samples and
recover the perturbation in negative stories. We propose an approach of
constructing negative samples by mimicking the errors commonly observed in
existing NLG models, including repeated plots, conflicting logic, and
long-range incoherence. Experiments on two story datasets demonstrate that
UNION is a reliable measure for evaluating the quality of generated stories,
which correlates better with human judgments and is more generalizable than
existing state-of-the-art metrics.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - DeltaScore: Fine-Grained Story Evaluation with Perturbations [69.33536214124878]
We introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects.
Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations.
We measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models.
arXiv Detail & Related papers (2023-03-15T23:45:54Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - Plot-guided Adversarial Example Construction for Evaluating Open-domain
Story Generation [23.646133241521614]
Learnable evaluation metrics have promised more accurate assessments by having higher correlations with human judgments.
Previous works relied on textitheuristically manipulated plausible examples to mimic possible system drawbacks.
We propose to tackle these issues by generating a more comprehensive set of implausible stories using em plots, which are structured representations of controllable factors used to generate stories.
arXiv Detail & Related papers (2021-04-12T20:19:24Z) - STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story
Generation [48.56586847883825]
We introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community.
Our dataset contains 6K lengthy stories with fine-grained natural language annotations interspersed throughout each narrative.
We evaluate language models fine-tuned on our dataset by integrating them onto STORIUM, where real authors can query a model for suggested story continuations and then edit them.
arXiv Detail & Related papers (2020-10-04T23:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.