Related papers: LLM-Assisted Code Cleaning For Training Accurate Code Generators

LLM-Assisted Code Cleaning For Training Accurate Code Generators

URL: http://arxiv.org/abs/2311.14904v1
Date: Sat, 25 Nov 2023 02:45:50 GMT
Title: LLM-Assisted Code Cleaning For Training Accurate Code Generators
Authors: Naman Jain, Tianjun Zhang, Wei-Lin Chiang, Joseph E. Gonzalez, Koushik Sen, Ion Stoica
Abstract summary: We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
Score: 53.087019724256606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.

Related papers

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs [1.8838588087156363]
This paper investigates the fine-tuning of code-generating Large Language Models (LLMs) We enhance the training data for the reward model with the help of symbolic execution techniques. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL.
arXiv Detail & Related papers (2025-04-21T16:29:07Z)
Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM [43.77512279007385]
Ling-Coder-Lite is a code large language model with comprehensive performance yet ultimate efficiency. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size.
arXiv Detail & Related papers (2025-03-22T15:00:18Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Less is More: Towards Green Code Large Language Models via Unified Structural Pruning [27.428983811427827]
We propose Flab-Pruner, an innovative unified structural pruning method that combines vocabulary, layer, and Feed-Forward Network (FFN) pruning. The results demonstrate that Flab-Pruner retains 97% of the original performance after pruning 22% of the parameters and achieves the same or even better performance after post-training.
arXiv Detail & Related papers (2024-12-20T14:13:09Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z)
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback [21.858896845159208]
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset.
arXiv Detail & Related papers (2024-06-11T21:53:46Z)
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z)
Performance-Aligned LLMs for Generating Fast Code [2.180216161965907]
We introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks.
arXiv Detail & Related papers (2024-04-29T16:52:38Z)
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.