SBFT Tool Competition 2024 -- Python Test Case Generation Track
- URL: http://arxiv.org/abs/2401.15189v1
- Date: Fri, 26 Jan 2024 20:21:15 GMT
- Title: SBFT Tool Competition 2024 -- Python Test Case Generation Track
- Authors: Nicolas Erni and Al-Ameen Mohammed Ali Mohammed and Christian Birchler
and Pouria Derakhshanfar and Stephan Lukasczyk and Sebastiano Panichella
- Abstract summary: Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information.
Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods.
This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.
- Score: 4.149356993529412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test case generation (TCG) for Python poses distinctive challenges due to the
language's dynamic nature and the absence of strict type information. Previous
research has successfully explored automated unit TCG for Python, with
solutions outperforming random test generation methods. Nevertheless,
fundamental issues persist, hindering the practical adoption of existing test
case generators. To address these challenges, we report on the organization,
challenges, and results of the first edition of the Python Testing Competition.
Four tools, namely UTBotPython, Klara, Hypothesis Ghostwriter, and Pynguin were
executed on a benchmark set consisting of 35 Python source files sampled from 7
open-source Python projects for a time budget of 400 seconds. We considered one
configuration of each tool for each test subject and evaluated the tools'
effectiveness in terms of code and mutation coverage. This paper describes our
methodology, the analysis of the results together with the competing tools, and
the challenges faced while running the competition experiments.
Related papers
- DyPyBench: A Benchmark of Executable Python Software [18.129031749321058]
We present DyPyBench, the first benchmark of Python projects that is large scale, diverse, ready to run and ready to analyze.
The benchmark encompasses 50 popular opensource projects from various application domains, with a total of 681k lines of Python code, and 30k test cases.
We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.
arXiv Detail & Related papers (2024-03-01T13:53:15Z) - Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages.
Experimental results reveal that it significantly outperforms Python Self-Consistency.
In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z) - BugsInPy: A Database of Existing Bugs in Python Programs to Enable
Controlled Testing and Debugging Studies [8.746971239693066]
For the first time, Python outperformed Java in Stack Overflow developer survey.
This is in stark contrast with the abundance of testing and debug tools for Java.
In this project, we create a benchmark database and tool that contain 493 real bugs from 17 real-world Python programs.
arXiv Detail & Related papers (2024-01-27T19:07:34Z) - Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z) - Tests4Py: A Benchmark for System Testing [11.051969638361012]
Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs.
Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
arXiv Detail & Related papers (2023-07-11T10:04:52Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - pyWATTS: Python Workflow Automation Tool for Time Series [0.20315704654772418]
pyWATTS is a non-sequential workflow automation tool for the analysis of time series data.
pyWATTS includes modules with clearly defined interfaces to enable seamless integration of new or existing methods.
pyWATTS supports key Python machine learning libraries such as scikit-learn, PyTorch, and Keras.
arXiv Detail & Related papers (2021-06-18T14:50:11Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython.
As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.