Related papers: Human-Aligned Code Readability Assessment with Large Language Models

Human-Aligned Code Readability Assessment with Large Language Models

URL: http://arxiv.org/abs/2510.16579v1
Date: Sat, 18 Oct 2025 17:00:52 GMT
Title: Human-Aligned Code Readability Assessment with Large Language Models
Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Pawel Borsukiewicz, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé,
Abstract summary: We introduce CoReEval, the first large-scale benchmark for evaluating Large Language Models (LLMs)-based code readability assessment.<n>LLMs offer a scalable alternative, but their behavior as readability evaluators remains underexplored.<n>Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts.
Score: 15.17270025276759
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models (LLMs) offer a scalable alternative, but their behavior as readability evaluators remains underexplored. We introduce CoReEval, the first large-scale benchmark for evaluating LLM-based code readability assessment, comprising over 1.4 million model-snippet-prompt evaluations across 10 state of the art LLMs. The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types (functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT), 9 decoding settings, and developer-guided prompts tailored to junior and senior personas. We compare LLM outputs against human annotations and a validated static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and justification quality (sentiment, aspect coverage, semantic clustering). Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts, enhances explanation quality, and enables lightweight personalization through persona framing. However, increased score variability highlights trade-offs between alignment, stability, and interpretability. CoReEval provides a robust foundation for prompt engineering, model alignment studies, and human in the loop evaluation, with applications in education, onboarding, and CI/CD pipelines where LLMs can serve as explainable, adaptable reviewers.

Related papers

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics [18.936553687978087]
Large language models (LLMs) are increasingly used as evaluators for natural language generation.<n>We introduce GER-Eval to investigate whether LLMs can design and apply their own evaluation rubrics.
arXiv Detail & Related papers (2026-02-09T13:56:06Z)
LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models [7.921987175359344]
LexInstructEval is a new benchmark and evaluation framework for fine-grained lexical instruction following.<n>Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical Procedure, Relation, Value> triplet.<n>This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-11-13T08:04:30Z)
evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments [0.0]
We present texttevalSmarT, a modular framework that leverages large language models (LLMs) as evaluators.<n>We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs.
arXiv Detail & Related papers (2025-07-28T12:37:43Z)
Can Large Language Models Serve as Evaluators for Code Summarization? [47.21347974031545]
Large Language Models (LLMs) serve as effective evaluators for code summarization methods.<n>LLMs prompt an agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst.<n> CODERPE achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
arXiv Detail & Related papers (2024-12-02T09:56:18Z)
Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension [39.277408536940825]
Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement.<n>Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based evaluation.
arXiv Detail & Related papers (2024-11-30T01:49:25Z)
FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs) It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.