WritingBench: A Comprehensive Benchmark for Generative Writing
- URL: http://arxiv.org/abs/2503.05244v3
- Date: Thu, 20 Mar 2025 05:13:53 GMT
- Title: WritingBench: A Comprehensive Benchmark for Generative Writing
- Authors: Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang,
- Abstract summary: We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.<n>We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.<n>This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
- Score: 87.48445972563631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
Related papers
- StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.
Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.
We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework [0.1979158763744267]
Open-ended text generation has become a prominent task in natural language processing.<n>Decoding methods often excel in some metrics while underperforming in others.<n>We present novel ranking strategies within this multicriteria framework.
arXiv Detail & Related papers (2024-10-24T11:32:01Z) - Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount.
Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks.
This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z) - Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models [8.920436030483872]
Large Language Models (LLMs) have impacted the writing process, enhancing productivity by collaborating with humans in content creation platforms.<n>We propose WritingPath, a framework that uses explicit outlines to guide LLMs in generating goal-oriented, high-quality text.
arXiv Detail & Related papers (2024-04-22T06:57:43Z) - MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues.
We propose the MATEval: A "Multi-Agent Text Evaluation framework"
Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.