Related papers: PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD comments

PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD comments

URL: http://arxiv.org/abs/2303.14029v2
Date: Fri, 11 Aug 2023 13:40:46 GMT
Title: PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD comments
Authors: Murali Sridharan, Leevi Rantala, Mika M\"antyl\"a
Abstract summary: Most Self-Admitted Technical Debt (SATD) research uses explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. This work addresses this gap through PENTACET (or 5C dataset) data. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 comments labeled as SATD.
Score: 3.6095388702618414
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals several SATD research uses simple SATD ('Easy to Find') code comments without the contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find' SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques.

Related papers

Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD) [4.114847619719728]
Self-Admitted Technical Debt (SATD) is a sub-type of technical debt (TD)<n>Previous research on SATD has focused predominantly on the Java programming language.<n>We introduce CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts.
arXiv Detail & Related papers (2025-05-02T09:25:41Z)
Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [6.004718679054704]
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal. We build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD. We introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts.
arXiv Detail & Related papers (2024-10-21T09:22:16Z)
SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts. We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z)
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts [62.55349777609194]
We construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies. We build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans.
arXiv Detail & Related papers (2023-12-28T18:16:00Z)
Explaining SAT Solving Using Causal Reasoning [30.469229388827443]
We introduce CausalSAT, which employs causal reasoning to gain insights into the functioning of modern SAT solvers. We use CausalSAT to quantitatively verify hypotheses previously regarded as "rules of thumb" or empirical findings.
arXiv Detail & Related papers (2023-06-09T22:53:16Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks. Our emphasis on database values highlights the new challenges of dirty database contents. Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z)
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z)
Estimating the hardness of SAT encodings for Logical Equivalence Checking of Boolean circuits [58.83758257568434]
We show that the hardness of SAT encodings for LEC instances can be estimated textitw.r.t some SAT partitioning. The paper proposes several methods for constructing partitionings, which, when used in practice, allow one to estimate the hardness of SAT encodings for LEC with good accuracy.
arXiv Detail & Related papers (2022-10-04T09:19:13Z)
Automatic Identification of Self-Admitted Technical Debt from Four Different Sources [3.446864074238136]
Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. Previous work has focused on identifying SATD from source code comments and issue trackers. We propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems.
arXiv Detail & Related papers (2022-02-04T20:59:25Z)
Identifying Self-Admitted Technical Debt in Issue Tracking Systems using Machine Learning [3.446864074238136]
Technical debt is a metaphor for sub-optimal solutions implemented for short-term benefits. Most work on identifying Self-Admitted Technical Debt focuses on source code comments. We propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning.
arXiv Detail & Related papers (2022-02-04T15:15:13Z)
CoSQA: 20,000+ Web Queries for Code Search and Question Answering [63.92224685262063]
CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes. We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
arXiv Detail & Related papers (2021-05-27T15:37:21Z)
COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic. COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.