PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD
comments
- URL: http://arxiv.org/abs/2303.14029v2
- Date: Fri, 11 Aug 2023 13:40:46 GMT
- Title: PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD
comments
- Authors: Murali Sridharan, Leevi Rantala, Mika M\"antyl\"a
- Abstract summary: Most Self-Admitted Technical Debt (SATD) research uses explicit SATD features such as 'TODO' and 'FIXME' for SATD detection.
This work addresses this gap through PENTACET (or 5C dataset) data.
The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 comments labeled as SATD.
- Score: 3.6095388702618414
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD
features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals
several SATD research uses simple SATD ('Easy to Find') code comments without
the contextual data (preceding and succeeding source code context). This work
addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large
Curated Contextual Code Comments per Contributor and the most extensive SATD
data. We mine 9,096 Open Source Software Java projects with a total of 435
million LOC. The outcome is a dataset with 23 million code comments, preceding
and succeeding source code context for each comment, and more than 250,000
comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find'
SATD. We believe PENTACET data will further SATD research using Artificial
Intelligence techniques.
Related papers
- Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [6.004718679054704]
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal.
We build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD.
We introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts.
arXiv Detail & Related papers (2024-10-21T09:22:16Z) - SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted
Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts.
We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z) - Explaining SAT Solving Using Causal Reasoning [30.469229388827443]
We introduce CausalSAT, which employs causal reasoning to gain insights into the functioning of modern SAT solvers.
We use CausalSAT to quantitatively verify hypotheses previously regarded as "rules of thumb" or empirical findings.
arXiv Detail & Related papers (2023-06-09T22:53:16Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z) - Estimating the hardness of SAT encodings for Logical Equivalence
Checking of Boolean circuits [58.83758257568434]
We show that the hardness of SAT encodings for LEC instances can be estimated textitw.r.t some SAT partitioning.
The paper proposes several methods for constructing partitionings, which, when used in practice, allow one to estimate the hardness of SAT encodings for LEC with good accuracy.
arXiv Detail & Related papers (2022-10-04T09:19:13Z) - Automatic Identification of Self-Admitted Technical Debt from Four
Different Sources [3.446864074238136]
Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems.
Previous work has focused on identifying SATD from source code comments and issue trackers.
We propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems.
arXiv Detail & Related papers (2022-02-04T20:59:25Z) - Identifying Self-Admitted Technical Debt in Issue Tracking Systems using
Machine Learning [3.446864074238136]
Technical debt is a metaphor for sub-optimal solutions implemented for short-term benefits.
Most work on identifying Self-Admitted Technical Debt focuses on source code comments.
We propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning.
arXiv Detail & Related papers (2022-02-04T15:15:13Z) - CoSQA: 20,000+ Web Queries for Code Search and Question Answering [63.92224685262063]
CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes.
We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching.
We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
arXiv Detail & Related papers (2021-05-27T15:37:21Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.