Adding Context to Source Code Representations for Deep Learning
- URL: http://arxiv.org/abs/2208.00203v1
- Date: Sat, 30 Jul 2022 12:47:32 GMT
- Title: Adding Context to Source Code Representations for Deep Learning
- Authors: Fuwei Tian and Christoph Treude
- Abstract summary: We argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed.
We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model.
- Score: 13.676416860721877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models have been successfully applied to a variety of software
engineering tasks, such as code classification, summarisation, and bug and
vulnerability detection. In order to apply deep learning to these tasks, source
code needs to be represented in a format that is suitable for input into the
deep learning model. Most approaches to representing source code, such as
tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow
graphs (CFGs) only focus on the code itself and do not take into account
additional context that could be useful for deep learning models. In this
paper, we argue that it is beneficial for deep learning models to have access
to additional contextual information about the code being analysed. We present
preliminary evidence that encoding context from the call hierarchy along with
information from the code itself can improve the performance of a
state-of-the-art deep learning model for two software engineering tasks. We
outline our research agenda for adding further contextual information to source
code representations for deep learning.
Related papers
- Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Encoding Version History Context for Better Code Representation [13.045078976464307]
This paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification.
To ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models.
arXiv Detail & Related papers (2024-02-06T07:35:36Z) - Source Code Data Augmentation for Deep Learning: A Survey [32.035973285175075]
We conduct a comprehensive survey of data augmentation for source code.
We highlight the general strategies and techniques to optimize the DA quality.
We outline the prevailing challenges and potential opportunities for future research.
arXiv Detail & Related papers (2023-05-31T14:47:44Z) - Improving Deep Representation Learning via Auxiliary Learnable Target Coding [69.79343510578877]
This paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning.
Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations.
arXiv Detail & Related papers (2023-05-30T01:38:54Z) - Survey of Code Search Based on Deep Learning [11.94599964179766]
This survey focuses on code search, that is, to retrieve code that matches a given query.
Deep learning, being able to extract complex semantics information, has achieved great success in this field.
We propose a new taxonomy to illustrate the state-of-the-art deep learning-based code search.
arXiv Detail & Related papers (2023-05-10T08:07:04Z) - A Survey of Deep Learning Models for Structural Code Understanding [21.66270320648155]
We present a comprehensive overview of the structures formed from code data.
We categorize the models for understanding code in recent years into two groups: sequence-based and graph-based models.
We also introduce metrics, datasets and the downstream tasks.
arXiv Detail & Related papers (2022-05-03T03:56:17Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Learning to map source code to software vulnerability using
code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.