AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
- URL: http://arxiv.org/abs/2504.07532v2
- Date: Sun, 20 Apr 2025 02:42:05 GMT
- Title: AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
- Authors: Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu,
- Abstract summary: Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise.<n>We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments.<n>Our experiments show that most of the competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ.<n>We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy
- Score: 39.00434175773803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that most of the competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM's practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.
Related papers
- Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach [32.654673913638426]
We propose an automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product.
Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts.
arXiv Detail & Related papers (2025-04-22T10:52:23Z) - WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.
We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.
This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z) - CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality.<n>CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.<n>We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z) - NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism [28.443004656952343]
We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism.
Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence.
We propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence.
arXiv Detail & Related papers (2024-02-29T21:05:14Z) - QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Quality-Diversity through AI Feedback [10.423093353553217]
Quality-diversity (QD) search algorithms aim at continually improving and diversifying a population of candidates.
Recent developments in language models (LMs) have enabled guiding search through AI feedback.
QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve.
arXiv Detail & Related papers (2023-10-19T12:13:58Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.