Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
- URL: http://arxiv.org/abs/2401.00288v1
- Date: Sat, 30 Dec 2023 17:48:37 GMT
- Title: Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
- Authors: Yao Wan, Yang He, Zhangqian Bi, Jianguo Zhang, Hongyu Zhang, Yulei
Sui, Guandong Xu, Hai Jin, Philip S. Yu
- Abstract summary: Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora.
Currently, there is already a thriving research community focusing on code intelligence.
- Score: 63.82016263181941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code intelligence leverages machine learning techniques to extract knowledge
from extensive code corpora, with the aim of developing intelligent tools to
improve the quality and productivity of computer programming. Currently, there
is already a thriving research community focusing on code intelligence, with
efforts ranging from software engineering, machine learning, data mining,
natural language processing, and programming languages. In this paper, we
conduct a comprehensive literature review on deep learning for code
intelligence, from the aspects of code representation learning, deep learning
techniques, and application tasks. We also benchmark several state-of-the-art
neural models for code intelligence, and provide an open-source toolkit
tailored for the rapid prototyping of deep-learning-based code intelligence
models. In particular, we inspect the existing code intelligence models under
the basis of code representation learning, and provide a comprehensive overview
to enhance comprehension of the present state of code intelligence.
Furthermore, we publicly release the source code and data resources to provide
the community with a ready-to-use benchmark, which can facilitate the
evaluation and comparison of existing and future code intelligence models
(https://xcodemind.github.io). At last, we also point out several challenging
and promising directions for future research.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - Adding Context to Source Code Representations for Deep Learning [13.676416860721877]
We argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed.
We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model.
arXiv Detail & Related papers (2022-07-30T12:47:32Z) - Ten Quick Tips for Deep Learning in Biology [116.78436313026478]
Machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling.
Deep learning has become its own subfield of machine learning.
In the context of biological research, deep learning has been increasingly used to derive novel insights from high-dimensional biological data.
arXiv Detail & Related papers (2021-05-29T21:02:44Z) - Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks [11.10732802304274]
Project CodeNet consists of 14M code samples and about 500M lines of code in 55 different programming languages.
Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark.
arXiv Detail & Related papers (2021-05-25T00:13:29Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Machine Learning in Python: Main developments and technology trends in
data science, machine learning, and artificial intelligence [3.1314898234563295]
Python continues to be the most preferred language for scientific computing, data science, and machine learning.
This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it.
arXiv Detail & Related papers (2020-02-12T05:20:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.