Related papers: Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

URL: http://arxiv.org/abs/2510.14616v1
Date: Thu, 16 Oct 2025 12:23:13 GMT
Title: Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
Authors: Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin,
Abstract summary: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed.<n>We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres.
Score: 87.75098311090642
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

Related papers

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z)
WISE: Web Information Satire and Fakeness Evaluation [0.9694940903078657]
MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models.<n>DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28% accuracy and 93.90% ROC-AUC.
arXiv Detail & Related papers (2025-12-30T05:44:32Z)
Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts [0.08749675983608168]
This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models.<n>We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries.<n>On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines.
arXiv Detail & Related papers (2025-11-28T10:27:48Z)
Large language models for automated PRISMA 2020 adherence checking [0.01588808390680495]
We constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews.<n>We evaluated ten large language models (LLMs) across five input formats.
arXiv Detail & Related papers (2025-11-20T02:08:13Z)
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models [12.445845925904466]
Language models serve as proxies for human preference judgements in alignment and evaluation.<n>They exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities.<n>This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations.
arXiv Detail & Related papers (2025-06-05T17:59:32Z)
WorldPM: Scaling Human Preference Modeling [130.23230492612214]
We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential.<n>We collect preference data from public forums covering diverse user communities.<n>We conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
arXiv Detail & Related papers (2025-05-15T17:38:37Z)
Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation [19.673388630963807]
We present TailoredBench, a method that conducts customized evaluation tailored to each target model.<n>A Global-coreset is first constructed as a probe to identify the most consistent source models for each target model.<n>A scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model.
arXiv Detail & Related papers (2025-02-19T09:31:50Z)
Stacking-Enhanced Bagging Ensemble Learning for Breast Cancer Classification with CNN [0.24578723416255752]
This paper proposes a CNN classification network based on Bagging and stacking ensemble learning methods for breast cancer classification. The model is capable of fast and accurate classification of input images. For binary classification (presence or absence of breast cancer), the accuracy reached 98.84%, and for five-class classification, the accuracy reached 98.34%.
arXiv Detail & Related papers (2024-07-15T09:44:43Z)
Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities. The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z)
(Certified!!) Adversarial Robustness for Free! [116.6052628829344]
We certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within a 2-norm of 0.5. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
arXiv Detail & Related papers (2022-06-21T17:27:27Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.