Code Needs Comments: Enhancing Code LLMs with Comment Augmentation
- URL: http://arxiv.org/abs/2402.13013v1
- Date: Tue, 20 Feb 2024 13:56:38 GMT
- Title: Code Needs Comments: Enhancing Code LLMs with Comment Augmentation
- Authors: Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan
Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin
- Abstract summary: We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
- Score: 91.52444946362547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The programming skill is one crucial ability for Large Language Models
(LLMs), necessitating a deep understanding of programming languages (PLs) and
their correlation with natural languages (NLs). We examine the impact of
pre-training data on code-focused LLMs' performance by assessing the comment
density as a measure of PL-NL alignment. Given the scarcity of code-comment
aligned data in pre-training corpora, we introduce a novel data augmentation
method that generates comments for existing code, coupled with a data filtering
strategy that filters out code data poorly correlated with natural language. We
conducted experiments on three code-focused LLMs and observed consistent
improvements in performance on two widely-used programming skill benchmarks.
Notably, the model trained on the augmented data outperformed both the model
used for generating comments and the model further trained on the data without
augmentation.
Related papers
- Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks.
We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code.
Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.