SBFT Tool Competition 2024 -- Python Test Case Generation Track
- URL: http://arxiv.org/abs/2401.15189v1
- Date: Fri, 26 Jan 2024 20:21:15 GMT
- Title: SBFT Tool Competition 2024 -- Python Test Case Generation Track
- Authors: Nicolas Erni and Al-Ameen Mohammed Ali Mohammed and Christian Birchler
and Pouria Derakhshanfar and Stephan Lukasczyk and Sebastiano Panichella
- Abstract summary: Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information.
Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods.
This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.
- Score: 4.149356993529412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test case generation (TCG) for Python poses distinctive challenges due to the
language's dynamic nature and the absence of strict type information. Previous
research has successfully explored automated unit TCG for Python, with
solutions outperforming random test generation methods. Nevertheless,
fundamental issues persist, hindering the practical adoption of existing test
case generators. To address these challenges, we report on the organization,
challenges, and results of the first edition of the Python Testing Competition.
Four tools, namely UTBotPython, Klara, Hypothesis Ghostwriter, and Pynguin were
executed on a benchmark set consisting of 35 Python source files sampled from 7
open-source Python projects for a time budget of 400 seconds. We considered one
configuration of each tool for each test subject and evaluated the tools'
effectiveness in terms of code and mutation coverage. This paper describes our
methodology, the analysis of the results together with the competing tools, and
the challenges faced while running the competition experiments.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods [0.21485350418225244]
Python is one of the fastest-growing programming languages and currently ranks as the top language in many lists.
It is imperative to be able to effectively train LLMs to generate good unit test cases for Python code.
This motivates the need for a large dataset to provide training and testing data.
arXiv Detail & Related papers (2025-02-07T18:19:12Z) - PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings.
PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers.
We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z) - Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis [6.123324869194196]
We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode.
We show ChatGPT solves fewer problems as difficulty rises.
Second, prompt engineering improves ChatGPT's performance.
Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket.
arXiv Detail & Related papers (2024-11-12T04:01:09Z) - DyPyBench: A Benchmark of Executable Python Software [18.129031749321058]
We present DyPyBench, the first benchmark of Python projects that is large scale, diverse, ready to run and ready to analyze.
The benchmark encompasses 50 popular opensource projects from various application domains, with a total of 681k lines of Python code, and 30k test cases.
We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.
arXiv Detail & Related papers (2024-03-01T13:53:15Z) - Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages.
Experimental results reveal that it significantly outperforms Python Self-Consistency.
In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z) - Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z) - Tests4Py: A Benchmark for System Testing [11.051969638361012]
Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs.
Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
arXiv Detail & Related papers (2023-07-11T10:04:52Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.