Cut the CARP: Fishing for zero-shot story evaluation
- URL: http://arxiv.org/abs/2110.03111v2
- Date: Fri, 8 Oct 2021 17:27:35 GMT
- Title: Cut the CARP: Fishing for zero-shot story evaluation
- Authors: Shahbuland Matiana, JR Smith, Ryan Teehan, Louis Castricato, Stella
Biderman, Leo Gao, Spencer Frazier
- Abstract summary: Contrastive Authoring and Reviewing Pairing is a scalable, efficient method for performing superior, zero-shot evaluation of stories.
We show a strong correlation between human evaluation of stories and those of CARP.
We also present and analyze the Story-Critique dataset, a new corpora composed of 1.3 million aligned story-critique pairs derived from over 80,000 stories.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large-scale language models (Raffel et al., 2019; Brown et
al., 2020) have brought significant qualitative and quantitative improvements
in machine-driven text generation. Despite this, generation and evaluation of
machine-generated narrative text remains a challenging problem. Objective
evaluation of computationally-generated stories may be prohibitively expensive,
require meticulously annotated datasets, or may not adequately measure the
logical coherence of a generated story's narratological structure.
Informed by recent advances in contrastive learning (Radford et al., 2021),
we present Contrastive Authoring and Reviewing Pairing (CARP): a scalable,
efficient method for performing qualitatively superior, zero-shot evaluation of
stories. We show a strong correlation between human evaluation of stories and
those of CARP. Model outputs more significantly correlate with corresponding
human input than those language-model based methods which utilize finetuning or
prompt engineering approaches. We also present and analyze the Story-Critique
Dataset, a new corpora composed of 1.3 million aligned story-critique pairs
derived from over 80,000 stories. We expect this corpus to be of interest to
NLP researchers.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition [8.058451580903123]
We introduce a novel method that measures story quality in terms of human likeness.
We then use this method to evaluate the stories generated by several models.
Upgrading the visual and language components of TAPM results in a model that yields competitive performance.
arXiv Detail & Related papers (2024-07-05T14:48:15Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - StoryAnalogy: Deriving Story-level Analogies from Large Language Models
to Unlock Analogical Understanding [72.38872974837462]
We evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus.
textscStory Analogy contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory.
We observe that the data in textscStory Analogy can improve the quality of analogy generation in large language models.
arXiv Detail & Related papers (2023-10-19T16:29:23Z) - FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - MOCHA: A Multi-Task Training Approach for Coherent Text Generation from
Cognitive Perspective [22.69509556890676]
We propose a novel multi-task training strategy for coherent text generation grounded on the cognitive theory of writing.
We extensively evaluate our model on three open-ended generation tasks including story generation, news article writing and argument generation.
arXiv Detail & Related papers (2022-10-26T11:55:41Z) - TopNet: Learning from Neural Topic Model to Generate Long Stories [43.5564336855688]
Long story generation (LSG) is one of the coveted goals in natural language processing.
We propose emphTopNet to obtain high-quality skeleton words to complement the short input.
Our proposed framework is highly effective in skeleton word selection and significantly outperforms state-of-the-art models in both automatic evaluation and human evaluation.
arXiv Detail & Related papers (2021-12-14T09:47:53Z) - STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story
Generation [48.56586847883825]
We introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community.
Our dataset contains 6K lengthy stories with fine-grained natural language annotations interspersed throughout each narrative.
We evaluate language models fine-tuned on our dataset by integrating them onto STORIUM, where real authors can query a model for suggested story continuations and then edit them.
arXiv Detail & Related papers (2020-10-04T23:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.