Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics
- URL: http://arxiv.org/abs/2510.15345v1
- Date: Fri, 17 Oct 2025 06:17:21 GMT
- Title: Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics
- Authors: Catarina G Belem, Parker Glenn, Alfy Samuel, Anoop Kumar, Daben Liu,
- Abstract summary: This work investigates the factors shaping human perceptions of readability through the analysis of 897 judgments.<n>We evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics.<n>Four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6.
- Score: 4.729984735375468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.
Related papers
- Evaluating the Evaluators: Are readability metrics good measures of readability? [36.138020084479784]
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences.<n>Traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL), have not been compared to human readability judgments in PLS.<n>We show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments.
arXiv Detail & Related papers (2025-08-26T17:38:42Z) - Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z) - Measuring and Modifying the Readability of English Texts with GPT-4 [2.532202013576547]
We find readability estimates from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments.
In a pre-registered human experiment, we ask whether Turbo can reliably make text easier or harder to read.
We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained.
arXiv Detail & Related papers (2024-10-17T21:04:28Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Measuring the Measuring Tools: An Automatic Evaluation of Semantic
Metrics for Text Corpora [5.254054636427663]
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications.
We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics.
arXiv Detail & Related papers (2022-11-29T14:47:07Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - On the Interpretability and Significance of Bias Metrics in Texts: a
PMI-based Approach [3.2326259807823026]
We analyze an alternative PMI-based metric to quantify biases in texts.
It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences.
arXiv Detail & Related papers (2021-04-13T19:34:17Z) - LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation.
This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics.
Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.