On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o
- URL: http://arxiv.org/abs/2502.07399v1
- Date: Tue, 11 Feb 2025 09:27:00 GMT
- Title: On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o
- Authors: Rundong Liu, Andre Frade, Amal Vaidya, Maxime Labonne, Marcus Kaiser, Bismayan Chakrabarti, Jonathan Budd, Sean Moran,
- Abstract summary: This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality.<n>The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries.<n>Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning with established code quality metrics.
- Score: 1.5960340244043023
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.
Related papers
- Leveraging Reward Models for Guiding Code Review Comment Generation [13.306560805316103]
Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems.<n>Deep learning techniques are able to tackle the generative aspect of code review, by commenting on a given code as a human reviewer would do.<n>In this paper, we introduce CoRAL, a deep learning framework automating review comment generation by exploiting reinforcement learning with a reward mechanism.
arXiv Detail & Related papers (2025-06-04T21:31:38Z) - CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [31.817325318218003]
CoQuIR is the first large-scale, multilingual benchmark designed to evaluate quality-aware code retrieval.<n>CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages.
arXiv Detail & Related papers (2025-05-31T13:00:17Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation [19.071855537400463]
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance.
CoCo-Bench is designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review.
arXiv Detail & Related papers (2025-04-29T11:57:23Z) - Code Summarization Beyond Function Level [0.213063058314067]
This study investigated the effectiveness of code summarization models beyond the function level.
The fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization.
Repository-level summarization exhibited promising potential but requires significant computational resources.
arXiv Detail & Related papers (2025-02-23T20:31:21Z) - Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.
We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.
Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z) - Harnessing Large Language Models for Curated Code Reviews [2.5944208050492183]
In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes.<n>Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models.<n>We propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset.
arXiv Detail & Related papers (2025-02-05T18:15:09Z) - CodEv: An Automated Grading Framework Leveraging Large Language Models for Consistent and Constructive Feedback [0.0]
This study presents an automated grading framework, CodEv, which leverages Large Language Models (LLMs) to provide consistent and constructive feedback.<n>Our framework also integrates LLM ensembles to improve the accuracy and consistency of scores, along with agreement tests to deliver reliable feedback and code review comments.
arXiv Detail & Related papers (2025-01-10T03:09:46Z) - Can Large Language Models Serve as Evaluators for Code Summarization? [47.21347974031545]
Large Language Models (LLMs) serve as effective evaluators for code summarization methods.<n>LLMs prompt an agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst.<n> CODERPE achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
arXiv Detail & Related papers (2024-12-02T09:56:18Z) - CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.
CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z) - DOCE: Finding the Sweet Spot for Execution-Based Code Generation [69.5305729627198]
We propose a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-ging as the core components.
Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods.
arXiv Detail & Related papers (2024-08-25T07:10:36Z) - AI-powered Code Review with LLMs: Early Results [10.37036924997437]
We present a novel approach to improving software quality and efficiency through a Large Language Model (LLM)-based model.
Our proposed LLM-based AI agent model is trained on large code repositories.
It aims to detect code smells, identify potential bugs, provide suggestions for improvement, and optimize the code.
arXiv Detail & Related papers (2024-04-29T08:27:50Z) - Improving the Learning of Code Review Successive Tasks with Cross-Task
Knowledge Distillation [1.0878040851638]
We introduce a novel deep-learning architecture, named DISCOREV, which employs cross-task knowledge distillation to address these tasks simultaneously.
We show that our approach generates better review comments, as measured by the BLEU score, as well as more accurate code refinement according to the CodeBLEU score.
arXiv Detail & Related papers (2024-02-03T07:02:22Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - CodeBLEU: a Method for Automatic Evaluation of Code Synthesis [57.87741831987889]
In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy.
We introduce a new automatic evaluation metric, dubbed CodeBLEU.
It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.
arXiv Detail & Related papers (2020-09-22T03:10:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.