Plot-guided Adversarial Example Construction for Evaluating Open-domain
Story Generation
- URL: http://arxiv.org/abs/2104.05801v1
- Date: Mon, 12 Apr 2021 20:19:24 GMT
- Title: Plot-guided Adversarial Example Construction for Evaluating Open-domain
Story Generation
- Authors: Sarik Ghazarian, Zixi Liu, Akash SM, Ralph Weischedel, Aram Galstyan,
Nanyun Peng
- Abstract summary: Learnable evaluation metrics have promised more accurate assessments by having higher correlations with human judgments.
Previous works relied on textitheuristically manipulated plausible examples to mimic possible system drawbacks.
We propose to tackle these issues by generating a more comprehensive set of implausible stories using em plots, which are structured representations of controllable factors used to generate stories.
- Score: 23.646133241521614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent advances of open-domain story generation, the lack of
reliable automatic evaluation metrics becomes an increasingly imperative issue
that hinders the fast development of story generation. According to conducted
researches in this regard, learnable evaluation metrics have promised more
accurate assessments by having higher correlations with human judgments. A
critical bottleneck of obtaining a reliable learnable evaluation metric is the
lack of high-quality training data for classifiers to efficiently distinguish
plausible and implausible machine-generated stories. Previous works relied on
\textit{heuristically manipulated} plausible examples to mimic possible system
drawbacks such as repetition, contradiction, or irrelevant content in the text
level, which can be \textit{unnatural} and \textit{oversimplify} the
characteristics of implausible machine-generated stories. We propose to tackle
these issues by generating a more comprehensive set of implausible stories
using {\em plots}, which are structured representations of controllable factors
used to generate stories. Since these plots are compact and structured, it is
easier to manipulate them to generate text with targeted undesirable
properties, while at the same time maintain the grammatical correctness and
naturalness of the generated sentences. To improve the quality of generated
implausible stories, we further apply the adversarial filtering procedure
presented by \citet{zellers2018swag} to select a more nuanced set of
implausible texts. Experiments show that the evaluation metrics trained on our
generated data result in more reliable automatic assessments that correlate
remarkably better with human judgments compared to the baselines.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Factually Consistent Summarization via Reinforcement Learning with
Textual Entailment Feedback [57.816210168909286]
We leverage recent progress on textual entailment models to address this problem for abstractive summarization systems.
We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency.
Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
arXiv Detail & Related papers (2023-05-31T21:04:04Z) - Look-back Decoding for Open-Ended Text Generation [62.53302138266465]
We propose Look-back, an improved decoding algorithm that tracks the distribution distance between current and historical decoding steps.
Look-back can automatically predict potential repetitive phrase and topic drift, and remove tokens that may cause the failure modes.
We perform decoding experiments on document continuation and story generation, and demonstrate that Look-back is able to generate more fluent and coherent text.
arXiv Detail & Related papers (2023-05-22T20:42:37Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - Evaluating Factuality in Generation with Dependency-level Entailment [57.5316011554622]
We propose a new formulation of entailment that decomposes it at the level of dependency arcs.
We show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods.
arXiv Detail & Related papers (2020-10-12T06:43:10Z) - UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [92.42032403795879]
UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
arXiv Detail & Related papers (2020-09-16T11:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.