Evaluation of large language models for assessing code maintainability
- URL: http://arxiv.org/abs/2401.12714v1
- Date: Tue, 23 Jan 2024 12:29:42 GMT
- Title: Evaluation of large language models for assessing code maintainability
- Authors: Marc Dillmann, Julien Siebert, Adam Trendowicz
- Abstract summary: We investigate the association between the cross-entropy of code generated by ten different models and quality aspects.
Our results show that, controlling for the number of logical lines of codes, cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level.
While the complexity of LLMs affects the range of cross-entropy, this plays a significant role in predicting maintainability aspects.
- Score: 4.2909314120969855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Increased availability of open-source software repositories and recent
advances in code analysis using large language models (LLMs) has triggered a
wave of new work to automate software engineering tasks that were previously
very difficult to automate. In this paper, we investigate a recent line of work
that hypothesises that comparing the probability of code generated by LLMs with
the probability the current code would have had can indicate potential quality
problems. We investigate the association between the cross-entropy of code
generated by ten different models (based on GPT2 and Llama2) and the following
quality aspects: readability, understandability, complexity, modularisation,
and overall maintainability assessed by experts and available in an benchmark
dataset. Our results show that, controlling for the number of logical lines of
codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of
maintainability on a class level (the higher the cross-entropy the lower the
maintainability). However, this relation is reversed when one does not control
for LLOC (e.g., comparing small classes with longer ones). Furthermore, while
the complexity of LLMs affects the range of cross-entropy (smaller models tend
to have a wider range of cross-entropy), this plays a significant role in
predicting maintainability aspects. Our study limits itself on ten different
pretrained models (based on GPT2 and Llama2) and on maintainability aspects
collected by Schnappinger et al. When controlling for logical lines of code
(LLOC), cross-entropy is a predictor of maintainability. However, while related
work has shown the potential usefulness of cross-entropy at the level of tokens
or short sequences, at the class level this criterion alone may prove
insufficient to predict maintainability and further research is needed to make
best use of this information in practice.
Related papers
- Real-Time Anomaly Detection and Reactive Planning with Large Language Models [18.57162998677491]
Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot capabilities.
We present a two-stage reasoning framework that incorporates the judgement regarding potential anomalies into a safe control framework.
This enables our monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles.
arXiv Detail & Related papers (2024-07-11T17:59:22Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.
We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z) - Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models [27.24738197172374]
Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
arXiv Detail & Related papers (2024-03-06T21:45:35Z) - CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments.
We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques.
We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z) - AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential
Reasoning Ability [29.1826948551409]
AQA-Bench is a novel benchmark to assess the sequential reasoning capabilities of large language models.
We build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search.
Our investigations reveal several interesting findings.
arXiv Detail & Related papers (2024-02-14T18:59:33Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Certified Reinforcement Learning with Logic Guidance [78.2286146954051]
We propose a model-free RL algorithm that enables the use of Linear Temporal Logic (LTL) to formulate a goal for unknown continuous-state/action Markov Decision Processes (MDPs)
The algorithm is guaranteed to synthesise a control policy whose traces satisfy the specification with maximal probability.
arXiv Detail & Related papers (2019-02-02T20:09:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.