Measuring and Modifying the Readability of English Texts with GPT-4
- URL: http://arxiv.org/abs/2410.14028v1
- Date: Thu, 17 Oct 2024 21:04:28 GMT
- Title: Measuring and Modifying the Readability of English Texts with GPT-4
- Authors: Sean Trott, Pamela D. Rivière,
- Abstract summary: We find readability estimates from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments.
In a pre-registered human experiment, we ask whether Turbo can reliably make text easier or harder to read.
We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained.
- Score: 2.532202013576547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the readability of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced ``zero-shot'' from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments (r = 0.76 and r = 0.74, respectively), out-performing estimates derived from traditional readability formulas and various psycholinguistic indices. Then, in a pre-registered human experiment (N = 59), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including limited scope, as well as the validity of the ``readability'' construct and its dependence on context, audience, and goal.
Related papers
- Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o [1.4401311275746886]
This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o.<n>We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90.<n>Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try.
arXiv Detail & Related papers (2026-02-26T01:46:40Z) - Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics [4.729984735375468]
This work investigates the factors shaping human perceptions of readability through the analysis of 897 judgments.<n>We evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics.<n>Four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6.
arXiv Detail & Related papers (2025-10-17T06:17:21Z) - Evaluating the Evaluators: Are readability metrics good measures of readability? [36.138020084479784]
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences.<n>Traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL), have not been compared to human readability judgments in PLS.<n>We show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments.
arXiv Detail & Related papers (2025-08-26T17:38:42Z) - When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z) - Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models [0.0]
Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally.
This study introduces a benchmark designed to quantify model-internal reasoning in different domains.
arXiv Detail & Related papers (2025-04-14T18:15:27Z) - ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability [62.285407189502216]
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions.
We introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process.
We show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
arXiv Detail & Related papers (2025-02-17T01:15:07Z) - Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews.
We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform.
We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z) - Beyond Turing Test: Can GPT-4 Sway Experts' Decisions? [14.964922012236498]
This paper explores how generated text impacts readers' decisions, focusing on both amateur and expert audiences.
Our findings indicate that GPT-4 can generate persuasive analyses affecting the decisions of both amateurs and professionals.
The results highlight a high correlation between real-world evaluation through audience reactions and the current multi-dimensional evaluators commonly used for generative models.
arXiv Detail & Related papers (2024-09-25T07:55:36Z) - Says Who? Effective Zero-Shot Annotation of Focalization [0.0]
Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features.
Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging.
Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%.
arXiv Detail & Related papers (2024-09-17T17:50:15Z) - One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books.
Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify.
On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z) - Supporting Human Raters with the Detection of Harmful Content using Large Language Models [8.580258386804282]
We demonstrate that large language models (LLMs) can achieve 90% accuracy when compared to human verdicts.
We propose five design patterns that integrate LLMs with human rating.
We share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity.
arXiv Detail & Related papers (2024-06-18T17:12:50Z) - An Evaluation of Estimative Uncertainty in Large Language Models [3.04503073434724]
Estimative uncertainty has long been an area of study -- including by intelligence agencies like the CIA.
This study compares estimative uncertainty in commonly used large language models (LLMs) to that of humans, and to each other.
We show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English.
arXiv Detail & Related papers (2024-05-24T03:39:31Z) - Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs)
It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions.
Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z) - Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews [51.453135368388686]
We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM)
Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level.
arXiv Detail & Related papers (2024-03-11T21:51:39Z) - An In-depth Evaluation of GPT-4 in Sentence Simplification with
Error-based Human Assessment [10.816677544269782]
We design an error-based human annotation framework to assess the GPT-4's simplification capabilities.
Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art.
arXiv Detail & Related papers (2024-03-08T00:19:24Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.