Back to the Future! Studying Data Cleanness in Defects4J and its Impact on Fault Localization
- URL: http://arxiv.org/abs/2310.19139v3
- Date: Thu, 8 Aug 2024 00:41:11 GMT
- Title: Back to the Future! Studying Data Cleanness in Defects4J and its Impact on Fault Localization
- Authors: Md Nakhla Rafi, An Ran Chen, Tse-Hsun Chen, Shaohua Wang,
- Abstract summary: We examine Defects4J's fault-triggering tests, emphasizing the implications of developer knowledge of SBFL techniques.
We found that 55% of the fault-triggering tests were newly added to replicate the bug or to test for regression.
We also found that 22% of the fault-triggering tests were modified after the bug reports were created, containing developer knowledge of the bug.
- Score: 3.8040257966829802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For software testing research, Defects4J stands out as the primary benchmark dataset, offering a controlled environment to study real bugs from prominent open-source systems. However, prior research indicates that Defects4J might include tests added post-bug report, embedding developer knowledge and affecting fault localization efficacy. In this paper, we examine Defects4J's fault-triggering tests, emphasizing the implications of developer knowledge of SBFL techniques. We study the timelines of changes made to these tests concerning bug report creation. Then, we study the effectiveness of SBFL techniques without developer knowledge in the tests. We found that 1) 55% of the fault-triggering tests were newly added to replicate the bug or to test for regression; 2) 22% of the fault-triggering tests were modified after the bug reports were created, containing developer knowledge of the bug; 3) developers often modify the tests to include new assertions or change the test code to reflect the changes in the source code; and 4) the performance of SBFL techniques degrades significantly (up to --415% for Mean First Rank) when evaluated on the bugs without developer knowledge. We provide a dataset of bugs without developer insights, aiding future SBFL evaluations in Defects4J and informing considerations for future bug benchmarks.
Related papers
- Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.
We investigated 207 versions of 6 open-source projects.
Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z) - Design choices made by LLM-based test generators prevent them from finding bugs [0.850206009406913]
This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code.
Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests.
arXiv Detail & Related papers (2024-12-18T18:33:26Z) - Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - Rethinking the Influence of Source Code on Test Case Generation [22.168699378889148]
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context.
This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests?
Our evaluation results demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests.
arXiv Detail & Related papers (2024-09-14T15:17:34Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests [44.13331329339185]
We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization.
Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
arXiv Detail & Related papers (2024-05-01T15:15:52Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - An Analysis of Bugs In Persistent Memory Application [0.0]
We evaluate an open-sourced automatic bug detector tool (i.e. AGAMOTTO) to test NVM level hashing PM application.
Our faithful validation tool able to discovered 65 new NVM level hashing bugs on PMDK library.
We will propose a Deep-Q Learning search algorithm over the PM-Aware search algorithm to improve the searching strategy efficiently.
arXiv Detail & Related papers (2023-07-19T23:12:01Z) - What Happens When We Fuzz? Investigating OSS-Fuzz Bug History [0.9772968596463595]
We analyzed 44,102 reported issues made public by OSS-Fuzz prior to March 12, 2022.
We identified the bug-contributing commits to estimate when the bug containing code was introduced, and measure the timeline from introduction to detection to fix.
arXiv Detail & Related papers (2023-05-19T05:15:36Z) - Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size.
We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks.
Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z) - SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video
Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems.
We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub.
The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.