RoViST:Learning Robust Metrics for Visual Storytelling
- URL: http://arxiv.org/abs/2205.03774v1
- Date: Sun, 8 May 2022 03:51:22 GMT
- Title: RoViST:Learning Robust Metrics for Visual Storytelling
- Authors: Eileen Wang, Caren Han, Josiah Poon
- Abstract summary: We propose 3 evaluation metrics sets that analyses which aspects we would look for in a good story.
We measure the reliability of our metric sets by analysing its correlation with human judgement scores on a sample of machine stories.
- Score: 2.7124743347047033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual storytelling (VST) is the task of generating a story paragraph that
describes a given image sequence. Most existing storytelling approaches have
evaluated their models using traditional natural language generation metrics
like BLEU or CIDEr. However, such metrics based on n-gram matching tend to have
poor correlation with human evaluation scores and do not explicitly consider
other criteria necessary for storytelling such as sentence structure or topic
coherence. Moreover, a single score is not enough to assess a story as it does
not inform us about what specific errors were made by the model. In this paper,
we propose 3 evaluation metrics sets that analyses which aspects we would look
for in a good story: 1) visual grounding, 2) coherence, and 3) non-redundancy.
We measure the reliability of our metric sets by analysing its correlation with
human judgement scores on a sample of machine stories obtained from 4
state-of-the-arts models trained on the Visual Storytelling Dataset (VIST). Our
metric sets outperforms other metrics on human correlation, and could be served
as a learning based evaluation metric set that is complementary to existing
rule-based metrics.
Related papers
- Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition [8.058451580903123]
We introduce a novel method that measures story quality in terms of human likeness.
We then use this method to evaluate the stories generated by several models.
Upgrading the visual and language components of TAPM results in a model that yields competitive performance.
arXiv Detail & Related papers (2024-07-05T14:48:15Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures
for Image Captioning Models [1.534667887016089]
We propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs.
We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations.
arXiv Detail & Related papers (2023-11-07T18:33:34Z) - DeltaScore: Fine-Grained Story Evaluation with Perturbations [69.33536214124878]
We introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects.
Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations.
We measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models.
arXiv Detail & Related papers (2023-03-15T23:45:54Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [92.42032403795879]
UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
arXiv Detail & Related papers (2020-09-16T11:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.