Importance Guided Data Augmentation for Neural-Based Code Understanding
- URL: http://arxiv.org/abs/2402.15769v1
- Date: Sat, 24 Feb 2024 08:57:12 GMT
- Title: Importance Guided Data Augmentation for Neural-Based Code Understanding
- Authors: Zeming Dong, Qiang Hu, Xiaofei Xie, Maxime Cordy, Mike Papadakis,
Jianjun Zhao
- Abstract summary: We introduce a general data augmentation framework, GenCode, to enhance the training of code understanding models.
Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.
- Score: 29.69495788091569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained code models lead the era of code intelligence. Many models have
been designed with impressive performance recently. However, one important
problem, data augmentation for code data that automatically helps developers
prepare training data lacks study in the field of code learning. In this paper,
we introduce a general data augmentation framework, GenCode, to enhance the
training of code understanding models. GenCode follows a
generation-and-selection paradigm to prepare useful training codes.
Specifically, it uses code transformation techniques to generate new code
candidates first and then selects important ones as the training data by
importance metrics. To evaluate the effectiveness of GenCode with a general
importance metric -- loss value, we conduct experiments on four code
understanding tasks (e.g., code clone detection) and three pre-trained code
models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code
augmentation method, MixCode, GenCode produces code models with 2.92% higher
accuracy and 4.90% robustness on average.
Related papers
- Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [69.38352966504401]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - CCT5: A Code-Change-Oriented Pre-Trained Model [14.225942520238936]
We propose to pre-train a model specially designed for code changes to better support developers in software maintenance.
We first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages.
We fine-tune the pre-trained model, CCT5, on three widely-labelled tasks incurred by code changes and two tasks specific to the code review process.
arXiv Detail & Related papers (2023-05-18T07:55:37Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Enhancing Code Classification by Mixup-Based Data Augmentation [16.49710700412084]
We propose a Mixup-based data augmentation approach, MixCode, to enhance the source code classification task.
We evaluate MixCode on two programming languages (JAVA and Python), two code tasks (problem classification and bug detection), four datasets (JAVA250, Python800, CodRep1, and Refactory) and five model architectures.
arXiv Detail & Related papers (2022-10-06T15:47:54Z) - NatGen: Generative pre-training by "Naturalizing" source code [18.410818213965918]
We propose a new pre-training objective, "Naturalizing" of source code.
Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale.
We fine-tune our model in three generative Software Engineering tasks to achieve state-of-the-art performance rivaling CodeT5.
arXiv Detail & Related papers (2022-06-15T15:08:29Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.