Source Code Data Augmentation for Deep Learning: A Survey
- URL: http://arxiv.org/abs/2305.19915v4
- Date: Mon, 13 Nov 2023 17:34:53 GMT
- Title: Source Code Data Augmentation for Deep Learning: A Survey
- Authors: Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du,
Zhenchang Xing, David Lo
- Abstract summary: We conduct a comprehensive survey of data augmentation for source code.
We highlight the general strategies and techniques to optimize the DA quality.
We outline the prevailing challenges and potential opportunities for future research.
- Score: 32.035973285175075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasingly popular adoption of deep learning models in many critical
source code tasks motivates the development of data augmentation (DA)
techniques to enhance training data and improve various capabilities (e.g.,
robustness and generalizability) of these models. Although a series of DA
methods have been proposed and tailored for source code models, there lacks a
comprehensive survey and examination to understand their effectiveness and
implications. This paper fills this gap by conducting a comprehensive and
integrative survey of data augmentation for source code, wherein we
systematically compile and encapsulate existing literature to provide a
comprehensive overview of the field. We start with an introduction of data
augmentation in source code and then provide a discussion on major
representative approaches. Next, we highlight the general strategies and
techniques to optimize the DA quality. Subsequently, we underscore techniques
useful in real-world source code scenarios and downstream tasks. Finally, we
outline the prevailing challenges and potential opportunities for future
research. In essence, we aim to demystify the corpus of existing literature on
source code DA for deep learning, and foster further exploration in this
sphere. Complementing this, we present a continually updated GitHub repository
that hosts a list of update-to-date papers on DA for source code modeling,
accessible at \url{https://github.com/terryyz/DataAug4Code}.
Related papers
- Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report [3.4632900249241874]
This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source.
The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval.
The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors.
arXiv Detail & Related papers (2024-10-21T12:21:49Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Data Optimization in Deep Learning: A Survey [3.1274367448459253]
This study aims to organize a wide range of existing data optimization methodologies for deep learning.
The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension.
The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques.
arXiv Detail & Related papers (2023-10-25T09:33:57Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - Boosting Source Code Learning with Data Augmentation: An Empirical Study [16.49710700412084]
We study whether data augmentation methods originally used for text and graphs are effective in improving the training quality of source code learning.
Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning.
arXiv Detail & Related papers (2023-03-13T01:47:05Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Exploring Representation-Level Augmentation for Code Search [50.94201167562845]
We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
arXiv Detail & Related papers (2022-10-21T22:47:37Z) - Adding Context to Source Code Representations for Deep Learning [13.676416860721877]
We argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed.
We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model.
arXiv Detail & Related papers (2022-07-30T12:47:32Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.