COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging
- URL: http://arxiv.org/abs/2503.18251v1
- Date: Mon, 24 Mar 2025 00:29:43 GMT
- Title: COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging
- Authors: Kuldeep Gautam, S. VenkataKeerthy, Ramakrishna Upadrasta,
- Abstract summary: COFO is a dataset consisting of 809 classes/output with a total of 369K source codes written in C, C++, Java, and Python programming languages.<n>We envision that this dataset can be useful for solving machine learning-based problems like program classification/problems, tagging, predicting program properties, and code comprehension.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.
Related papers
- Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora.
Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - Tag Prediction of Competitive Programming Problems using Deep Learning
Techniques [0.0]
A well-liked method for developing programming abilities is competitive programming.
It can be tough for novices and even veteran programmers to traverse the wide collection of questions.
This can be done using automated tagging of the questions using Text Classification.
arXiv Detail & Related papers (2023-08-03T16:39:02Z) - Code Smells for Machine Learning Applications [6.759291241573661]
There is a lack of guidelines for code quality in machine learning applications.
This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources.
We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions.
arXiv Detail & Related papers (2022-03-25T16:23:02Z) - Project CodeNet: A Large-Scale AI for Code Dataset for Learning a
Diversity of Coding Tasks [11.10732802304274]
Project CodeNet consists of 14M code samples and about 500M lines of code in 55 different programming languages.
Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark.
arXiv Detail & Related papers (2021-05-25T00:13:29Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
and Generation [72.90209988513995]
CodeXGLUE is a benchmark dataset to foster machine learning research for program understanding and generation.
CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
arXiv Detail & Related papers (2021-02-09T06:16:25Z) - Machine Learning for Software Engineering: A Systematic Mapping [73.30245214374027]
The software development industry is rapidly adopting machine learning for transitioning modern day software systems towards highly intelligent and self-learning systems.
No comprehensive study exists that explores the current state-of-the-art on the adoption of machine learning across software engineering life cycle stages.
This study introduces a machine learning for software engineering (MLSE) taxonomy classifying the state-of-the-art machine learning techniques according to their applicability to various software engineering life cycle stages.
arXiv Detail & Related papers (2020-05-27T11:56:56Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Machine Learning in Python: Main developments and technology trends in
data science, machine learning, and artificial intelligence [3.1314898234563295]
Python continues to be the most preferred language for scientific computing, data science, and machine learning.
This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it.
arXiv Detail & Related papers (2020-02-12T05:20:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.