Precision or Peril: Evaluating Code Quality from Quantized Large Language Models
- URL: http://arxiv.org/abs/2411.10656v1
- Date: Sat, 16 Nov 2024 01:31:29 GMT
- Title: Precision or Peril: Evaluating Code Quality from Quantized Large Language Models
- Authors: Eric L. Melin, Adam J. Torek, Nasir U. Eisty, Casey Kennington,
- Abstract summary: Quantization has emerged as a way to mitigate the memory overhead of Large Language Models.
This study aims to evaluate the current code generation capabilities of smaller LLMs using various metrics.
- Score: 0.5249805590164902
- License:
- Abstract: When scaled to hundreds of billions of parameters, Large Language Models (LLMs) such as GPT-4 and LLaMA-405b have demonstrated remarkable capabilities in tasks such as code generation, code completion, and writing test cases. However, scaling up model sizes results in exponentially higher computational cost and energy consumption, leaving a large carbon footprint and making these models difficult to use by academic researchers and small businesses. Quantization has emerged as a way to mitigate the memory overhead of LLMs, allowing them to run on smaller hardware for lower prices. Quantization, however, may have detrimental effects on a model's output and it's effects on LLM generated code quality remains understudied and requires constant evaluation as LLMs are improved. This study aims to evaluate the current code generation capabilities of smaller LLMs using various metrics, exploring the impact of quantization on code quality, and identifying prevalent quality issues in the generated code. Method: We conducted a comprehensive evaluation of four smaller open-source LLMs across two benchmarks and code similarity scores. The impact of 8-bit and 4-bit quantization was analyzed, and a static analysis tool was utilized to scrutinize the generated code's quality. Our findings reveal that while the tested LLMs exhibit potential, these smaller LLMs produce code with subpar performance on established benchmarks. The effects of quantization on code quality are inconsistent, and the generated code frequently exhibits recurring quality and maintainability issues. This study underscores the necessity for careful scrutiny and validation of LLM-generated code before its adoption in software projects. While smaller LLMs can generate code, their output requires careful monitoring and validation by practitioners before integration into software projects.
Related papers
- CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement [32.46078765471136]
We introduce CodeLutra, a novel framework that enhances low-performing large language models.
Unlike conventional fine-tuning, CodeLutra employs an iterative preference learning mechanism to compare correct and incorrect solutions.
On a challenging data analysis task, using just 500 samples improved Llama-3-8B's accuracy from 28.2% to 48.6%, approaching GPT-4's performance.
arXiv Detail & Related papers (2024-11-07T21:51:07Z) - A Performance Study of LLM-Generated Code on Leetcode [1.747820331822631]
This study evaluates the efficiency of code generation by Large Language Models (LLMs)
We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance.
We find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans.
arXiv Detail & Related papers (2024-07-31T13:10:03Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Large Language Models for Code Analysis: Do LLMs Really Do Their Job? [13.48555476110316]
Large language models (LLMs) have demonstrated significant potential in the realm of natural language understanding and programming code processing tasks.
This paper offers a comprehensive evaluation of LLMs' capabilities in performing code analysis tasks.
arXiv Detail & Related papers (2023-10-18T22:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.