Better Language Models of Code through Self-Improvement
- URL: http://arxiv.org/abs/2304.01228v2
- Date: Wed, 10 May 2023 02:36:40 GMT
- Title: Better Language Models of Code through Self-Improvement
- Authors: Hung Quoc To, Nghi D. Q. Bui, Jin Guo, Tien N. Nguyen
- Abstract summary: We propose a simple data augmentation framework for pre-trained language models for code (PLMCs)
Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step.
The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks.
- Score: 18.75015225501755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models for code (PLMCs) have gained attention in recent
research. These models are pre-trained on large-scale datasets using
multi-modal objectives. However, fine-tuning them requires extensive
supervision and is limited by the size of the dataset provided. We aim to
improve this issue by proposing a simple data augmentation framework. Our
framework utilizes knowledge gained during the pre-training and fine-tuning
stage to generate pseudo data, which is then used as training data for the next
step. We incorporate this framework into the state-of-the-art language models,
such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework
significantly improves PLMCs' performance in code-related sequence generation
tasks, such as code summarization and code generation in the CodeXGLUE
benchmark.
Related papers
- Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - PanGu-Coder2: Boosting Large Language Models for Code with Ranking
Feedback [5.459517921633247]
We propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation.
Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark.
arXiv Detail & Related papers (2023-07-27T15:28:29Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code.
This issue is also observed in language modeling for long text generation.
In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation [10.75138604869187]
In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data.
We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model.
Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
arXiv Detail & Related papers (2022-08-22T06:57:51Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.