Enhancing Source Code Representations for Deep Learning with Static
Analysis
- URL: http://arxiv.org/abs/2402.09557v1
- Date: Wed, 14 Feb 2024 20:17:04 GMT
- Title: Enhancing Source Code Representations for Deep Learning with Static
Analysis
- Authors: Xueting Guan, Christoph Treude
- Abstract summary: This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
- Score: 10.222207222039048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning techniques applied to program analysis tasks such as code
classification, summarization, and bug detection have seen widespread interest.
Traditional approaches, however, treat programming source code as natural
language text, which may neglect significant structural or semantic details.
Additionally, most current methods of representing source code focus solely on
the code, without considering beneficial additional context. This paper
explores the integration of static analysis and additional context such as bug
reports and design patterns into source code representations for deep learning
models. We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and
augment it with additional context information obtained from bug reports and
design patterns, creating an enriched source code representation that
significantly enhances the performance of common software engineering tasks
such as code classification and code clone detection. Utilizing existing
open-source code data, our approach improves the representation and processing
of source code, thereby improving task performance.
Related papers
- Encoding Version History Context for Better Code Representation [13.045078976464307]
This paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification.
To ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models.
arXiv Detail & Related papers (2024-02-06T07:35:36Z) - Boosting Source Code Learning with Data Augmentation: An Empirical Study [16.49710700412084]
We study whether data augmentation methods originally used for text and graphs are effective in improving the training quality of source code learning.
Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning.
arXiv Detail & Related papers (2023-03-13T01:47:05Z) - Exploring Representation-Level Augmentation for Code Search [50.94201167562845]
We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
arXiv Detail & Related papers (2022-10-21T22:47:37Z) - Adding Context to Source Code Representations for Deep Learning [13.676416860721877]
We argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed.
We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model.
arXiv Detail & Related papers (2022-07-30T12:47:32Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Code2Image: Intelligent Code Analysis by Computer Vision Techniques and
Application to Vulnerability Prediction [0.6091702876917281]
We present a novel method to represent source code as image while preserving semantic and syntactic properties.
The method makes it possible to directly enter the resulting image representation of source codes into deep learning (DL) algorithms as input.
We demonstrate feasibility and effectiveness of our method by realizing a vulnerability prediction use case over a public dataset.
arXiv Detail & Related papers (2021-05-07T09:10:20Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.