CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
- URL: http://arxiv.org/abs/2503.17136v1
- Date: Fri, 21 Mar 2025 13:37:46 GMT
- Title: CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
- Authors: Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang,
- Abstract summary: Chain of thought (CoT) generates free-text explanations that help guide a model's predictions.<n>Self-Consistency (SC) marginalizes predictions over multiple generated explanations.<n>We propose $textbfC$hain-$textbfo$f-$textbfKe$ywords (CoKe)
- Score: 90.15027447565427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.
Related papers
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Hierarchical Indexing for Retrieval-Augmented Opinion Summarization [60.5923941324953]
We propose a method for unsupervised abstractive opinion summarization that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs)
Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy.
At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews.
arXiv Detail & Related papers (2024-03-01T10:38:07Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models [3.656114607436271]
Existing text scoring methods require a large corpus, struggle with short texts, or require hand-labeled data.<n>We develop a text scoring framework that leverages generative large language models (LLMs)<n>We apply this approach to better understand speech reflecting aversion to specific political parties on Twitter.
arXiv Detail & Related papers (2023-10-18T15:34:37Z) - Automatic Counterfactual Augmentation for Robust Text Classification
Based on Word-Group Search [12.894936637198471]
In general, a keyword is regarded as a shortcut if it creates a superficial association with the label, resulting in a false prediction.
We propose a new Word-Group mining approach, which captures the causal effect of any keyword combination and orders the combinations that most affect the prediction.
Our approach bases on effective post-hoc analysis and beam search, which ensures the mining effect and reduces the complexity.
arXiv Detail & Related papers (2023-07-01T02:26:34Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards Document-Level Paraphrase Generation with Sentence Rewriting and
Reordering [88.08581016329398]
We propose CoRPG (Coherence Relationship guided Paraphrase Generation) for document-level paraphrase generation.
We use graph GRU to encode the coherence relationship graph and get the coherence-aware representation for each sentence.
Our model can generate document paraphrase with more diversity and semantic preservation.
arXiv Detail & Related papers (2021-09-15T05:53:40Z) - Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement
Learning [30.09715149060206]
Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document.
In this paper, we propose a new fine-grained evaluation metric that considers different granularity.
For learning more recessive linguistic patterns, we use a pre-trained model (e.g., BERT) to compute the continuous similarity score between predicted keyphrases and target keyphrases.
arXiv Detail & Related papers (2021-04-18T10:13:46Z) - Plot-guided Adversarial Example Construction for Evaluating Open-domain
Story Generation [23.646133241521614]
Learnable evaluation metrics have promised more accurate assessments by having higher correlations with human judgments.
Previous works relied on textitheuristically manipulated plausible examples to mimic possible system drawbacks.
We propose to tackle these issues by generating a more comprehensive set of implausible stories using em plots, which are structured representations of controllable factors used to generate stories.
arXiv Detail & Related papers (2021-04-12T20:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.