Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models
- URL: http://arxiv.org/abs/2507.05289v2
- Date: Wed, 09 Jul 2025 18:24:41 GMT
- Title: Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models
- Authors: Igor Regis da Silva Simoes, Elaine Venson,
- Abstract summary: Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards.<n>This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner.
- Score: 2.3204178451683264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards. However, measuring this attribute poses challenges in both industry and academia. While static analysis tools assess attributes such as code smells and comment percentage, code reviews introduce an element of subjectivity. This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner. We conducted a quasi-experiment study to measure the effects of code changes on Large Language Model (LLM)s interpretation regarding its readability quality attribute. Nine LLMs were tested, undergoing three interventions: removing comments, replacing identifier names with obscure names, and refactoring to remove code smells. Each intervention involved 10 batch analyses per LLM, collecting data on response variability. We compared the results with a known reference model and tool. The results showed that all LLMs were sensitive to the interventions, with agreement with the reference classifier being high for the original and refactored code scenarios. The LLMs demonstrated a strong semantic sensitivity that the reference model did not fully capture. A thematic analysis of the LLMs reasoning confirmed their evaluations directly reflected the nature of each intervention. The models also exhibited response variability, with 9.37% to 14.58% of executions showing a standard deviation greater than zero, indicating response oscillation, though this did not always compromise the statistical significance of the results. LLMs demonstrated potential for evaluating semantic quality aspects, such as coherence between identifier names, comments, and documentation with code purpose.
Related papers
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Re-Evaluating Code LLM Benchmarks Under Semantic Mutation [8.58692613099365]
We present an empirical study to investigate prompt sensitivity in code benchmarks.<n>We propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure.<n>Our findings suggest that even slight prompt variations can lead to significant shifts in performance.
arXiv Detail & Related papers (2025-06-20T15:30:36Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension [39.277408536940825]
Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement.<n>Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based evaluation.
arXiv Detail & Related papers (2024-11-30T01:49:25Z) - Evaluating Source Code Quality with Large Languagem Models: a comparative study [2.3204178451683264]
This paper describes and analyzes the results obtained using Large Language Model (LLM) as a static analysis tool.
A total of 1,641 classes were analyzed, comparing the results in two versions of models: GPT 3.5 Turbo and GPT 4o.
The GPT 4o version did not present the same results, diverging from the previous model and Sonar by assigning a high classification to codes that were assessed as lower quality.
arXiv Detail & Related papers (2024-08-07T18:44:46Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.040736633675136]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.<n>We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.<n>We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive
Critiquing [139.77117915309023]
CRITIC allows large language models to validate and amend their own outputs in a manner similar to human interaction with tools.
Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs.
arXiv Detail & Related papers (2023-05-19T15:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.