From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets
- URL: http://arxiv.org/abs/2504.17977v1
- Date: Thu, 24 Apr 2025 23:07:04 GMT
- Title: From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets
- Authors: Hao-Nan Zhu, Robert M. Furth, Michael Pradel, Cindy Rubio-González,
- Abstract summary: Software defect datasets are collections of software bugs and their associated information.<n>Over the years, numerous software defect datasets have been developed, providing rich resources for the community.<n>This article provides a comprehensive survey of 132 software defect datasets.
- Score: 19.140541190998842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software defect datasets, which are collections of software bugs and their associated information, are essential resources for researchers and practitioners in software engineering and beyond. Such datasets facilitate empirical research and enable standardized benchmarking for a wide range of techniques, including fault detection, fault localization, test generation, test prioritization, automated program repair, and emerging areas like agentic AI-based software development. Over the years, numerous software defect datasets with diverse characteristics have been developed, providing rich resources for the community, yet making it increasingly difficult to navigate the landscape. To address this challenge, this article provides a comprehensive survey of 132 software defect datasets. The survey discusses the scope of existing datasets, e.g., regarding the application domain of the buggy software, the types of defects, and the programming languages used. We also examine the construction of these datasets, including the data sources and construction methods employed. Furthermore, we assess the availability and usability of the datasets, validating their availability and examining how defects are presented. To better understand the practical uses of these datasets, we analyze the publications that cite them, revealing that the primary use cases are evaluations of new techniques and empirical research. Based on our comprehensive review of the existing datasets, this paper suggests potential opportunities for future research, including addressing underrepresented kinds of defects, enhancing availability and usability through better dataset organization, and developing more efficient strategies for dataset construction and maintenance.
Related papers
- Rethinking Software Misconfigurations in the Real World: An Empirical Study and Literature Analysis [9.88064494257381]
We conduct an empirical study on 823 real-world misconfiguration issues, based on which we propose a novel classification of the root causes of software misconfigurations.<n>We find that the research targets have changed from fundamental software to advanced applications.<n>In the meanwhile, the research on non-crash misconfigurations such as performance degradation and security risks also has a significant growth.
arXiv Detail & Related papers (2024-12-15T08:53:41Z) - Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering [13.17302533571231]
Deep learning (DL) systems are prone to bugs from many sources, including training data.
Existing literature suggests that bugs in training data are highly prevalent.
We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
arXiv Detail & Related papers (2024-11-19T00:28:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection [0.0]
A critical barrier to progress is the scarcity of comprehensive datasets featuring annotated defects.
This systematic review, spanning from 2015 to 2023, identifies 15 publicly available datasets.
The goal of this systematic review is to consolidate these datasets in a single location, providing researchers with a comprehensive reference.
arXiv Detail & Related papers (2024-06-11T20:14:59Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - A Survey of Dataset Refinement for Problems in Computer Vision Datasets [11.45536223418548]
Large-scale datasets have played a crucial role in the advancement of computer vision.
They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs.
Various data-centric solutions have been proposed to solve the dataset problems.
They improve the quality of datasets by re-organizing them, which we call dataset refinement.
arXiv Detail & Related papers (2022-10-21T03:58:43Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Building Inspection Toolkit: Unified Evaluation and Strong Baselines for
Damage Recognition [0.0]
We introduce the building inspection toolkit -- bikit -- which acts as a simple to use data hub containing relevant open-source datasets in the field of damage recognition.
The datasets are enriched with evaluation splits and predefined metrics, suiting the specific task and their data distribution.
For the sake of compatibility and to motivate researchers in this domain, we also provide a leaderboard and the possibility to share model weights with the community.
arXiv Detail & Related papers (2022-02-14T20:05:59Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.