Mind the Gap: The Difference Between Coverage and Mutation Score Can
Guide Testing Efforts
- URL: http://arxiv.org/abs/2309.02395v1
- Date: Tue, 5 Sep 2023 17:05:52 GMT
- Title: Mind the Gap: The Difference Between Coverage and Mutation Score Can
Guide Testing Efforts
- Authors: Kush Jain, Goutamkumar Tulajappa Kalburgi, Claire Le Goues, Alex Groce
- Abstract summary: An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation.
Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage.
We propose a new framework for reasoning about the extent, limits, and nature of a given testing effort based on an idea we call the oracle gap.
- Score: 8.128730027609471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An "adequate" test suite should effectively find all inconsistencies between
a system's requirements/specifications and its implementation. Practitioners
frequently use code coverage to approximate adequacy, while academics argue
that mutation score may better approximate true (oracular) adequacy coverage.
High code coverage is increasingly attainable even on large systems via
automatic test generation, including fuzzing. In light of all of these options
for measuring and improving testing effort, how should a QA engineer spend
their time? We propose a new framework for reasoning about the extent, limits,
and nature of a given testing effort based on an idea we call the oracle gap,
or the difference between source code coverage and mutation score for a given
software element. We conduct (1) a large-scale observational study of the
oracle gap across popular Maven projects, (2) a study that varies testing and
oracle quality across several of those projects and (3) a small-scale
observational study of highly critical, well-tested code across comparable
blockchain projects. We show that the oracle gap surfaces important information
about the extent and quality of a test effort beyond either adequacy metric
alone. In particular, it provides a way for practitioners to identify source
files where it is likely a weak oracle tests important code.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project.
We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.
We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.
We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z) - Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of
System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics.
The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing.
We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - Perfect is the enemy of test oracle [1.457696018869121]
Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes.
This paper presents SEER, a learning-based approach that in the absence of test assertions can determine whether a unit test passes or fails on a given method under test (MUT)
Our experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is effective in predicting the fail or pass labels.
arXiv Detail & Related papers (2023-02-03T01:49:33Z) - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep
Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
Each baseline is a self-contained experiment pipeline with easily reusable and extendable components.
We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.