Related papers: SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt

SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt

URL: http://arxiv.org/abs/2403.07690v1
Date: Tue, 12 Mar 2024 14:33:53 GMT
Title: SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt
Authors: Edi Sutoyo, Andrea Capiluppi
Abstract summary: Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts. We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
Score: 6.699060157800401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts, workarounds, or temporary solutions within the codebase. Over recent years, researchers have manually labeled datasets derived from various software development artifacts: source code comments, messages from the issue tracker and pull request sections, and commit messages. These datasets are designed for training, evaluation, performance validation, and improvement of machine learning and deep learning models to accurately identify SATD instances. However, class imbalance poses a serious challenge across all the existing datasets, particularly when researchers are interested in categorizing the specific types of SATD. In order to address the scarcity of labeled data for SATD \textit{identification} (i.e., whether an instance is SATD or not) and \textit{categorization} (i.e., which type of SATD is being classified) in existing datasets, we share the \textit{SATDAUG} dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages. These augmented datasets have been balanced in relation to the available artifacts and provide a much richer source of labeled data for training machine learning or deep learning models.

Related papers

Improving the detection of technical debt in Java source code with an enriched dataset [12.07607688189035]
Technical debt (TD) is the additional work and costs that emerge when developers opt for a quick and easy solution to a problem. Recent research has focused on detecting Self-Admitted Technical Debts (SATDs) by analyzing comments embedded in source code. We curated the first ever dataset of TD identified by code comments, coupled with its associated source code.
arXiv Detail & Related papers (2024-11-08T10:12:33Z)
Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [6.004718679054704]
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal. We build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD. We introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts.
arXiv Detail & Related papers (2024-10-21T09:22:16Z)
A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems [13.90991624629898]
This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in Deep Learning systems. We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves.
arXiv Detail & Related papers (2024-09-18T09:21:10Z)
Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We? [17.128428286986573]
This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. We use this dataset to experiment with seven different generative deep learning (DL) model configurations.
arXiv Detail & Related papers (2023-08-17T12:27:32Z)
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Identifying Self-Admitted Technical Debt in Issue Tracking Systems using Machine Learning [3.446864074238136]
Technical debt is a metaphor for sub-optimal solutions implemented for short-term benefits. Most work on identifying Self-Admitted Technical Debt focuses on source code comments. We propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning.
arXiv Detail & Related papers (2022-02-04T15:15:13Z)
The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender. We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim. We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting. Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.