Boosting Source Code Learning with Data Augmentation: An Empirical Study
- URL: http://arxiv.org/abs/2303.06808v1
- Date: Mon, 13 Mar 2023 01:47:05 GMT
- Title: Boosting Source Code Learning with Data Augmentation: An Empirical Study
- Authors: Zeming Dong, Qiang Hu, Yuejun Guo, Zhenya Zhang, Maxime Cordy, Mike
Papadakis, Yves Le Traon, Jianjun Zhao
- Abstract summary: We study whether data augmentation methods originally used for text and graphs are effective in improving the training quality of source code learning.
Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning.
- Score: 16.49710700412084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The next era of program understanding is being propelled by the use of
machine learning to solve software problems. Recent studies have shown
surprising results of source code learning, which applies deep neural networks
(DNNs) to various critical software tasks, e.g., bug detection and clone
detection. This success can be greatly attributed to the utilization of massive
high-quality training data, and in practice, data augmentation, which is a
technique used to produce additional training data, has been widely adopted in
various domains, such as computer vision. However, in source code learning,
data augmentation has not been extensively studied, and existing practice is
limited to simple syntax-preserved methods, such as code refactoring.
Essentially, source code is often represented in two ways, namely, sequentially
as text data and structurally as graph data, when it is used as training data
in source code learning. Inspired by these analogy relations, we take an early
step to investigate whether data augmentation methods that are originally used
for text and graphs are effective in improving the training quality of source
code learning. To that end, we first collect and categorize data augmentation
methods in the literature. Second, we conduct a comprehensive empirical study
on four critical tasks and 11 DNN architectures to explore the effectiveness of
12 data augmentation methods (including code refactoring and 11 other methods
for text and graph data). Our results identify the data augmentation methods
that can produce more accurate and robust models for source code learning,
including those based on mixup (e.g., SenMixup for texts and Manifold-Mixup for
graphs), and those that slightly break the syntax of source code (e.g., random
swap and random deletion for texts).
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Source Code Data Augmentation for Deep Learning: A Survey [32.035973285175075]
We conduct a comprehensive survey of data augmentation for source code.
We highlight the general strategies and techniques to optimize the DA quality.
We outline the prevailing challenges and potential opportunities for future research.
arXiv Detail & Related papers (2023-05-31T14:47:44Z) - Exploring Representation-Level Augmentation for Code Search [50.94201167562845]
We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
arXiv Detail & Related papers (2022-10-21T22:47:37Z) - Adding Context to Source Code Representations for Deep Learning [13.676416860721877]
We argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed.
We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model.
arXiv Detail & Related papers (2022-07-30T12:47:32Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.