Related papers: Better Language Models of Code through Self-Improvement

Better Language Models of Code through Self-Improvement

URL: http://arxiv.org/abs/2304.01228v2
Date: Wed, 10 May 2023 02:36:40 GMT
Title: Better Language Models of Code through Self-Improvement
Authors: Hung Quoc To, Nghi D. Q. Bui, Jin Guo, Tien N. Nguyen
Abstract summary: We propose a simple data augmentation framework for pre-trained language models for code (PLMCs) Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks.
Score: 18.75015225501755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.

Related papers

RETROcode: Leveraging a Code Database for Improved Natural Language to Code Generation [10.19019476978683]
We present RETROcode, a novel adaptation of the RETRO architecture for sequence-to-sequence models. Our findings indicate that RETROcode not only outperforms similar-sized traditional architectures on test sets but also approaches the effectiveness of the much larger Codex model.
arXiv Detail & Related papers (2025-04-08T07:41:13Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z)
Large Language Model for Verilog Generation with Code-Structure-Guided Reinforcement Learning [29.135207235743795]
This paper introduces VeriSeek, an LLM enhanced by reinforcement learning to achieve high Verilog code generation performance. Our reinforcement learning approach employs code structure information as feedback signals to refine the pre-trained model. Experiments show that VeriSeek outperforms state-of-the-art methods across multiple benchmarks.
arXiv Detail & Related papers (2024-07-21T11:25:21Z)
Collaborative decoding of critical tokens for boosting factuality of large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation. The common practice of using sampling during generation also increases chances of hallucination. We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z)
Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback [5.459517921633247]
We propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark.
arXiv Detail & Related papers (2023-07-27T15:28:29Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code. This issue is also observed in language modeling for long text generation. In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z)
CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code. We conduct a human study to identify the criteria for high-quality explanatory docstring for code. We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation [10.75138604869187]
In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data. We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model. Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
arXiv Detail & Related papers (2022-08-22T06:57:51Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.