PELLI: Framework to effectively integrate LLMs for quality software generation
- URL: http://arxiv.org/abs/2602.10808v1
- Date: Wed, 11 Feb 2026 12:51:08 GMT
- Title: PELLI: Framework to effectively integrate LLMs for quality software generation
- Authors: Rasmus Krebs, Somnath Mazumdar,
- Abstract summary: This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI)<n>PELLI is an iterative analysis-based process that upholds high-quality code changes.<n>Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better.
- Score: 0.3867363075280543
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI's applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study's outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.
Related papers
- LaQual: A Novel Framework for Automated Evaluation of LLM App Quality [10.124358468702031]
LaQual is an automated framework for evaluating the quality of LLM apps.<n>LaQual consists of three main stages: first, it labels and classifies LLM apps in a hierarchical way to accurately match them to different scenarios.<n> Experiments on a popular LLM app store show that LaQual is effective.
arXiv Detail & Related papers (2025-08-26T03:25:49Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z) - Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization [7.330697128881243]
We propose Metagente as a novel approach to amplify the synergy of various Large Language Models (LLMs)<n>Metagente is a Multi-Agent framework based on a series of LLMs to self-optimize the system through evaluation, feedback, and cooperation among specialized agents.<n>The performance gain compared to GitSum, the most relevant benchmark, ranges from 27.63% to 60.43%.
arXiv Detail & Related papers (2025-03-13T20:42:39Z) - A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.<n>It is constructed using a dataset curated from 30 well-known GitHub repositories.<n>We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z) - Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models [43.56644186785491]
This paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by large language models.
We analyze 28 representative LLMs based on RACE and find that current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios.
arXiv Detail & Related papers (2024-07-16T08:08:48Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames.
It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values.
Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.