Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)
- URL: http://arxiv.org/abs/2505.01136v2
- Date: Sun, 01 Jun 2025 12:49:01 GMT
- Title: Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)
- Authors: Phuoc Pham, Murali Sridharan, Matteo Esposito, Valentina Lenarduzzi,
- Abstract summary: Self-Admitted Technical Debt (SATD) is a sub-type of technical debt (TD)<n>Previous research on SATD has focused predominantly on the Java programming language.<n>We introduce CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts.
- Score: 4.114847619719728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.
Related papers
- Improving the detection of technical debt in Java source code with an enriched dataset [12.07607688189035]
Technical debt (TD) is the additional work and costs that emerge when developers opt for a quick and easy solution to a problem.
Recent research has focused on detecting Self-Admitted Technical Debts (SATDs) by analyzing comments embedded in source code.
We curated the first ever dataset of TD identified by code comments, coupled with its associated source code.
arXiv Detail & Related papers (2024-11-08T10:12:33Z) - Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [6.004718679054704]
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal.
We build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD.
We introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts.
arXiv Detail & Related papers (2024-10-21T09:22:16Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - An Exploratory Study of the Relationship between SATD and Other Software Development Activities [13.026170714454071]
Self-Admitted Technical Debt (SATD) is a specific type of Technical Debt that involves documenting code to remind developers of its debt.
Previous research has explored various aspects of SATD, including methods, distribution, and its impact on software quality.
This study investigates the relationship between removing and adding SATD and activities such as bug fixing, adding new features, and testing.
arXiv Detail & Related papers (2024-04-02T13:45:42Z) - Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation [65.25839671641218]
We propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes.<n>We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark.<n>We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization.
arXiv Detail & Related papers (2024-03-15T21:36:15Z) - SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted
Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts.
We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - PENTACET data -- 23 Million Contextual Code Comments and 250,000 SATD
comments [3.6095388702618414]
Most Self-Admitted Technical Debt (SATD) research uses explicit SATD features such as 'TODO' and 'FIXME' for SATD detection.
This work addresses this gap through PENTACET (or 5C dataset) data.
The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 comments labeled as SATD.
arXiv Detail & Related papers (2023-03-24T14:42:42Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.