Related papers: EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

URL: http://arxiv.org/abs/2506.02596v1
Date: Tue, 03 Jun 2025 08:14:46 GMT
Title: EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing
Authors: Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang,
Abstract summary: benchName is a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository.<n>We develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores.<n>We benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types.
Score: 47.704427451419456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.

Related papers

Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages [17.028968054304947]
MSumBench is a multi-dimensional, multi-domain evaluation of summarization in English and Chinese.<n>By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages.
arXiv Detail & Related papers (2025-05-31T13:12:35Z)
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.<n>We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.<n>Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.<n>We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.<n>This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z)
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character.<n>Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections.<n>These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks.<n>We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z)
Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education [1.6340559025561785]
Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text. We investigate whether LLMs can effectively assess human-written text for educational purposes.
arXiv Detail & Related papers (2024-07-24T06:02:57Z)
A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality [12.187586364960758]
We present a German corpus of 1,320 essays from school students of two age groups. Each essay has been manually annotated for argumentative structure and quality on multiple levels of granularity. We propose baseline approaches to argument mining and essay scoring, and we analyze interactions between both tasks.
arXiv Detail & Related papers (2024-04-03T07:31:53Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark [44.06803331843307]
paragraph-level topic structure can grasp and understand the overall context of a document from a higher level. The lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained research and applications. We propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. We employ a two-stage man-machine collaborative annotation method to construct the largest Chinese paragraph-level Topic Structure corpus.
arXiv Detail & Related papers (2023-05-24T06:43:23Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.