Tests4Py: A Benchmark for System Testing
- URL: http://arxiv.org/abs/2307.05147v2
- Date: Tue, 14 May 2024 12:34:26 GMT
- Title: Tests4Py: A Benchmark for System Testing
- Authors: Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller,
- Abstract summary: Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs.
Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
- Score: 11.051969638361012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
Related papers
- Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing [7.3166218350585135]
Unit testing is crucial in software engineering for ensuring quality.
It's not widely used in parallel and high-performance computing software, particularly scientific applications.
We propose an automated method for generating unit tests for such software.
arXiv Detail & Related papers (2024-07-06T22:45:55Z) - Introducing v0.5 of the AI Safety Benchmark from MLCommons [101.98401637778638]
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group.
The benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models.
arXiv Detail & Related papers (2024-04-18T15:01:00Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Overview of Test Coverage Criteria for Test Case Generation from Finite
State Machines Modelled as Directed Graphs [0.12891210250935145]
Test Coverage criteria are an essential concept for test engineers when generating the test cases from a System Under Test model.
Test Coverage criteria define the number of actions or combinations by which a system is tested.
This study summarized all commonly used test coverage criteria for Finite State Machines and discussed them regarding their subsumption, equivalence, or non-comparability.
arXiv Detail & Related papers (2022-03-17T20:30:14Z) - SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video
Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems.
We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub.
The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z) - Automated Support for Unit Test Generation: A Tutorial Book Chapter [21.716667622896193]
Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system is tested.
Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python.
This chapter introduces the concept of search-based unit test generation.
arXiv Detail & Related papers (2021-10-26T11:13:40Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.