Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks
- URL: http://arxiv.org/abs/2105.12655v1
- Date: Tue, 25 May 2021 00:13:29 GMT
- Title: Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks
- Authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo
Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey
Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler
- Abstract summary: Project CodeNet consists of 14M code samples and about 500M lines of code in 55 different programming languages.
Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark.
- Score: 11.10732802304274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in deep learning and machine learning algorithms have enabled
breakthrough progress in computer vision, speech recognition, natural language
processing and beyond. In addition, over the last several decades, software has
been built into the fabric of every aspect of our society. Together, these two
trends have generated new interest in the fast-emerging research area of AI for
Code. As software development becomes ubiquitous across all industries and code
infrastructure of enterprise legacy applications ages, it is more critical than
ever to increase software development productivity and modernize legacy
applications. Over the last decade, datasets like ImageNet, with its large
scale and diversity, have played a pivotal role in algorithmic advancements
from computer vision to language and speech understanding. In this paper, we
present Project CodeNet, a first-of-its-kind, very large scale, diverse, and
high-quality dataset to accelerate the algorithmic advancements in AI for Code.
It consists of 14M code samples and about 500M lines of code in 55 different
programming languages. Project CodeNet is not only unique in its scale, but
also in the diversity of coding tasks it can help benchmark: from code
similarity and classification for advances in code recommendation algorithms,
and code translation between a large variety programming languages, to advances
in code performance (both runtime, and memory) improvement techniques. CodeNet
also provides sample input and output test sets for over 7M code samples, which
can be critical for determining code equivalence in different languages. As a
usability feature, we provide several preprocessing tools in Project CodeNet to
transform source codes into representations that can be readily used as inputs
into machine learning models.
Related papers
- Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing [0.9668407688201359]
generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing.
We developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion.
We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing.
arXiv Detail & Related papers (2024-10-31T16:48:41Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora.
Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - A Comparative Study of Code Generation using ChatGPT 3.5 across 10
Programming Languages [0.0]
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training.
This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022.
The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains.
arXiv Detail & Related papers (2023-08-08T15:02:32Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.