Related papers: Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback

Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback

URL: http://arxiv.org/abs/2507.16007v1
Date: Mon, 21 Jul 2025 18:56:50 GMT
Title: Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback
Authors: Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata,
Abstract summary: We present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues.<n>We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics.
Score: 57.200668979963694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

Related papers

"I Wrote, I Paused, I Rewrote" Teaching LLMs to Read Between the Lines of Student Writing [0.0]
Large language models like Gemini are becoming common tools for supporting student writing.<n>Most of their feedback is based only on the final essay missing important context about how that text was written.<n>We built a digital writing tool that captures both what students type and how their essays evolve over time.
arXiv Detail & Related papers (2025-06-09T20:42:02Z)
Do LLMs Understand Why We Write Diaries? A Method for Purpose Extraction and Clustering [41.94295877935867]
This study introduces a novel method based on Large Language Models (LLMs) to identify and cluster the various purposes of diary writing.<n>Our approach is applied to Soviet-era diaries (1922-1929) from the Prozhito digital archive.
arXiv Detail & Related papers (2025-06-01T12:38:01Z)
Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition [2.048226951354646]
Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews.<n>This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature writing.
arXiv Detail & Related papers (2024-12-18T08:42:25Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z)
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions [63.307317584926146]
Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. We construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain.
arXiv Detail & Related papers (2024-03-06T17:16:44Z)
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers [25.268709339109893]
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories. We work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models) We compare GPT-4, Claude-2.1, and LLama-2-70B and find that all three models make faithfulness mistakes in over 50% of summaries.
arXiv Detail & Related papers (2024-03-02T01:52:14Z)
Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks. We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z)
Art or Artifice? Large Language Models and the False Promise of Creativity [53.04834589006685]
We propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals.
arXiv Detail & Related papers (2023-09-25T22:02:46Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Self-critiquing models for assisting human evaluators [11.1006983438712]
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs.
arXiv Detail & Related papers (2022-06-12T17:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.