Evaluating the Evaluators: Are readability metrics good measures of readability?
- URL: http://arxiv.org/abs/2508.19221v1
- Date: Tue, 26 Aug 2025 17:38:42 GMT
- Title: Evaluating the Evaluators: Are readability metrics good measures of readability?
- Authors: Isabel Cachola, Daniel Khashabi, Mark Dredze,
- Abstract summary: Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences.<n>Traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL), have not been compared to human readability judgments in PLS.<n>We show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments.
- Score: 36.138020084479784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.
Related papers
- When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment [29.603396943658428]
Large language models (LLMs) can be used as proxies for human judges.<n>We show that models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need.<n>Experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues.
arXiv Detail & Related papers (2026-02-19T08:37:21Z) - Human-Aligned Code Readability Assessment with Large Language Models [15.17270025276759]
We introduce CoReEval, the first large-scale benchmark for evaluating Large Language Models (LLMs)-based code readability assessment.<n>LLMs offer a scalable alternative, but their behavior as readability evaluators remains underexplored.<n>Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts.
arXiv Detail & Related papers (2025-10-18T17:00:52Z) - Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics [4.729984735375468]
This work investigates the factors shaping human perceptions of readability through the analysis of 897 judgments.<n>We evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics.<n>Four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6.
arXiv Detail & Related papers (2025-10-17T06:17:21Z) - Neither Valid nor Reliable? Investigating the Use of LLMs as Judges [23.16086453334644]
Large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored.<n>This paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators.
arXiv Detail & Related papers (2025-08-25T14:43:10Z) - Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z) - Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z) - Leveraging LLMs to Evaluate Usefulness of Document [25.976948104719746]
We introduce a new user-centric evaluation framework that integrates users' search context and behavioral data into large language models.<n>Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness.<n>We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
arXiv Detail & Related papers (2025-06-10T09:44:03Z) - Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique [0.0]
This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs)<n>We demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges.
arXiv Detail & Related papers (2025-02-26T11:43:25Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.