Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks
- URL: http://arxiv.org/abs/2105.12655v1
- Date: Tue, 25 May 2021 00:13:29 GMT
- Title: Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks
- Authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo
Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey
Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler
- Abstract summary: Project CodeNet consists of 14M code samples and about 500M lines of code in 55 different programming languages.
Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark.
- Score: 11.10732802304274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in deep learning and machine learning algorithms have enabled
breakthrough progress in computer vision, speech recognition, natural language
processing and beyond. In addition, over the last several decades, software has
been built into the fabric of every aspect of our society. Together, these two
trends have generated new interest in the fast-emerging research area of AI for
Code. As software development becomes ubiquitous across all industries and code
infrastructure of enterprise legacy applications ages, it is more critical than
ever to increase software development productivity and modernize legacy
applications. Over the last decade, datasets like ImageNet, with its large
scale and diversity, have played a pivotal role in algorithmic advancements
from computer vision to language and speech understanding. In this paper, we
present Project CodeNet, a first-of-its-kind, very large scale, diverse, and
high-quality dataset to accelerate the algorithmic advancements in AI for Code.
It consists of 14M code samples and about 500M lines of code in 55 different
programming languages. Project CodeNet is not only unique in its scale, but
also in the diversity of coding tasks it can help benchmark: from code
similarity and classification for advances in code recommendation algorithms,
and code translation between a large variety programming languages, to advances
in code performance (both runtime, and memory) improvement techniques. CodeNet
also provides sample input and output test sets for over 7M code samples, which
can be critical for determining code equivalence in different languages. As a
usability feature, we provide several preprocessing tools in Project CodeNet to
transform source codes into representations that can be readily used as inputs
into machine learning models.
Related papers
- CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [59.32609948217718]
We present CodeIP, a new watermarking technique for Large Language Models (LLMs)-based code generation.
CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodePori: Large Scale Model for Autonomous Software Development by Using
Multi-Agents [3.8066447473175304]
Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) are reshaping the field of Software Engineering (SE)
This paper introduces CodePori, a novel model designed to automate code generation for extensive and complex software projects based on natural language prompts.
We show in the paper that CodePori is able to generate running code for large-scale projects, completing the entire software development process in minutes rather than hours, and at a cost of a few dollars.
arXiv Detail & Related papers (2024-02-02T13:42:50Z) - Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora.
Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - A Comparative Study of Code Generation using ChatGPT 3.5 across 10
Programming Languages [0.0]
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training.
This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022.
The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains.
arXiv Detail & Related papers (2023-08-08T15:02:32Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.