DeltaScore: Fine-Grained Story Evaluation with Perturbations
- URL: http://arxiv.org/abs/2303.08991v5
- Date: Thu, 2 Nov 2023 06:08:44 GMT
- Title: DeltaScore: Fine-Grained Story Evaluation with Perturbations
- Authors: Zhuohan Xie, Miao Li, Trevor Cohn and Jey Han Lau
- Abstract summary: We introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects.
Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations.
We measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models.
- Score: 69.33536214124878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous evaluation metrics have been developed for natural language
generation tasks, but their effectiveness in evaluating stories is limited as
they are not specifically tailored to assess intricate aspects of storytelling,
such as fluency and interestingness. In this paper, we introduce DELTASCORE, a
novel methodology that employs perturbation techniques for the evaluation of
nuanced story aspects. Our central proposition posits that the extent to which
a story excels in a specific aspect (e.g., fluency) correlates with the
magnitude of its susceptibility to particular perturbations (e.g., the
introduction of typos). Given this, we measure the quality of an aspect by
calculating the likelihood difference between pre- and post-perturbation states
using pre-trained language models. We compare DELTASCORE with existing metrics
on storytelling datasets from two domains in five fine-grained story aspects:
fluency, coherence, relatedness, logicality, and interestingness. DELTASCORE
demonstrates remarkable performance, revealing a surprising finding that a
specific perturbation proves highly effective in capturing multiple aspects.
Related papers
- Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.
Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.
We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition [8.058451580903123]
We introduce a novel method that measures story quality in terms of human likeness.
We then use this method to evaluate the stories generated by several models.
Upgrading the visual and language components of TAPM results in a model that yields competitive performance.
arXiv Detail & Related papers (2024-07-05T14:48:15Z) - Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning [47.02027575768659]
We introduce continuous valence and arousal labels for an existing dataset of children's stories originally annotated with discrete emotion categories.
For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach.
A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story.
arXiv Detail & Related papers (2024-06-04T12:17:16Z) - Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - RoViST:Learning Robust Metrics for Visual Storytelling [2.7124743347047033]
We propose 3 evaluation metrics sets that analyses which aspects we would look for in a good story.
We measure the reliability of our metric sets by analysing its correlation with human judgement scores on a sample of machine stories.
arXiv Detail & Related papers (2022-05-08T03:51:22Z) - A Temporal Variational Model for Story Generation [21.99104738567138]
Recent language models can generate interesting and grammatically correct text in story generation but often lack plot development and long-term coherence.
This paper experiments with a latent vector planning approach based on a TD-VAE (Temporal Difference Variational Autoencoder)
The results demonstrate strong performance in automatic cloze and swapping evaluations.
arXiv Detail & Related papers (2021-09-14T16:36:12Z) - UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [92.42032403795879]
UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
arXiv Detail & Related papers (2020-09-16T11:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.