Enhancing Code Classification by Mixup-Based Data Augmentation
- URL: http://arxiv.org/abs/2210.03003v1
- Date: Thu, 6 Oct 2022 15:47:54 GMT
- Title: Enhancing Code Classification by Mixup-Based Data Augmentation
- Authors: Zeming Dong, Qiang Hu, Yuejun Guo, Maxime Cordy, Mike Papadakis, Yves
Le Traon, and Jianjun Zhao
- Abstract summary: We propose a Mixup-based data augmentation approach, MixCode, to enhance the source code classification task.
We evaluate MixCode on two programming languages (JAVA and Python), two code tasks (problem classification and bug detection), four datasets (JAVA250, Python800, CodRep1, and Refactory) and five model architectures.
- Score: 16.49710700412084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, deep neural networks (DNNs) have been widely applied in programming
language understanding. Generally, training a DNN model with competitive
performance requires massive and high-quality labeled training data. However,
collecting and labeling such data is time-consuming and labor-intensive. To
tackle this issue, data augmentation has been a popular solution, which
delicately increases the training data size, e.g., adversarial example
generation. However, few works focus on employing it for programming
language-related tasks. In this paper, we propose a Mixup-based data
augmentation approach, MixCode, to enhance the source code classification task.
First, we utilize multiple code refactoring methods to generate
label-consistent code data. Second, the Mixup technique is employed to mix the
original code and transformed code to form the new training data to train the
model. We evaluate MixCode on two programming languages (JAVA and Python), two
code tasks (problem classification and bug detection), four datasets (JAVA250,
Python800, CodRep1, and Refactory), and 5 model architectures. Experimental
results demonstrate that MixCode outperforms the standard data augmentation
baseline by up to 6.24\% accuracy improvement and 26.06\% robustness
improvement.
Related papers
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - CoRNStack: High-Quality Contrastive Data for Better Code Ranking [45.18877655831977]
We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages.
This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives.
We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks.
arXiv Detail & Related papers (2024-12-01T23:54:12Z) - GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding [28.02426812004216]
We introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models.
To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks and three pre-trained code models.
Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.
arXiv Detail & Related papers (2024-02-24T08:57:12Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - CCT5: A Code-Change-Oriented Pre-Trained Model [14.225942520238936]
We propose to pre-train a model specially designed for code changes to better support developers in software maintenance.
We first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages.
We fine-tune the pre-trained model, CCT5, on three widely-labelled tasks incurred by code changes and two tasks specific to the code review process.
arXiv Detail & Related papers (2023-05-18T07:55:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z) - XMixup: Efficient Transfer Learning with Auxiliary Samples by
Cross-domain Mixup [60.07531696857743]
Cross-domain Mixup (XMixup) improves the multitask paradigm for deep transfer learning.
XMixup selects the auxiliary samples from the source dataset and augments training samples via the simple mixup strategy.
Experiment results show that XMixup improves the accuracy by 1.9% on average.
arXiv Detail & Related papers (2020-07-20T16:42:29Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.