Tests4Py: A Benchmark for System Testing
- URL: http://arxiv.org/abs/2307.05147v2
- Date: Tue, 14 May 2024 12:34:26 GMT
- Title: Tests4Py: A Benchmark for System Testing
- Authors: Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller,
- Abstract summary: Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs.
Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
- Score: 11.051969638361012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - ViUniT: Visual Unit Tests for More Robust Visual Programming [104.55763189099125]
When models answer correctly, they produce incorrect programs 33% of the time.
We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests.
arXiv Detail & Related papers (2024-12-12T01:36:18Z) - Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? [90.30635552818875]
We present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs.
This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals.
We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets.
arXiv Detail & Related papers (2024-11-06T05:09:34Z) - TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites [1.4563527353943984]
Large Language Models (LLMs) have been applied to various aspects of software development.
We present AgoneTest: an automated system for generating test suites for Java projects.
arXiv Detail & Related papers (2024-08-14T23:02:16Z) - Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing [7.3166218350585135]
Unit testing is crucial in software engineering for ensuring quality.
It's not widely used in parallel and high-performance computing software, particularly scientific applications.
We propose an automated method for generating unit tests for such software.
arXiv Detail & Related papers (2024-07-06T22:45:55Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - Automated Support for Unit Test Generation: A Tutorial Book Chapter [21.716667622896193]
Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system is tested.
Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python.
This chapter introduces the concept of search-based unit test generation.
arXiv Detail & Related papers (2021-10-26T11:13:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.