Related papers: Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

URL: http://arxiv.org/abs/2507.22168v1
Date: Tue, 29 Jul 2025 18:59:09 GMT
Title: Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
Authors: Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu,
Abstract summary: We identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks.<n>Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance.
Score: 32.121191446326876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.

Related papers

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation [8.58692613099365]
We present an empirical study to investigate prompt sensitivity in code benchmarks.<n>We propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure.<n>Our findings suggest that even slight prompt variations can lead to significant shifts in performance.
arXiv Detail & Related papers (2025-06-20T15:30:36Z)
Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach [20.27214998822657]
Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis.<n>Existing benchmarks typically provide only a single input prompt for the evaluation of each problem.<n>We propose 10 mutation strategies and introduce three new metrics to evaluate their impact on code generation.
arXiv Detail & Related papers (2025-05-11T07:14:30Z)
Improve LLM-based Automatic Essay Scoring with Linguistic Features [46.41475844992872]
This paper develops a scoring system capable of handling essays across diverse prompts.<n>Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods.
arXiv Detail & Related papers (2025-02-13T17:09:52Z)
Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates [24.46103924394483]
The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates.
arXiv Detail & Related papers (2024-08-22T10:00:20Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.