Related papers: SBFT Tool Competition 2024 -- Python Test Case Generation Track

SBFT Tool Competition 2024 -- Python Test Case Generation Track

URL: http://arxiv.org/abs/2401.15189v1
Date: Fri, 26 Jan 2024 20:21:15 GMT
Title: SBFT Tool Competition 2024 -- Python Test Case Generation Track
Authors: Nicolas Erni and Al-Ameen Mohammed Ali Mohammed and Christian Birchler and Pouria Derakhshanfar and Stephan Lukasczyk and Sebastiano Panichella
Abstract summary: Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information. Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods. This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.
Score: 4.149356993529412
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information. Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods. Nevertheless, fundamental issues persist, hindering the practical adoption of existing test case generators. To address these challenges, we report on the organization, challenges, and results of the first edition of the Python Testing Competition. Four tools, namely UTBotPython, Klara, Hypothesis Ghostwriter, and Pynguin were executed on a benchmark set consisting of 35 Python source files sampled from 7 open-source Python projects for a time budget of 400 seconds. We considered one configuration of each tool for each test subject and evaluated the tools' effectiveness in terms of code and mutation coverage. This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.

Related papers

Precisely Detecting Python Type Errors via LLM-based Unit Test Generation [12.250956276862302]
RTED is a type-aware test generation technique for automatically detecting Python type errors.<n>We show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques.<n>It is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision.
arXiv Detail & Related papers (2025-07-03T05:10:33Z)
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming [56.17331530444765]
CPRet is a retrieval-oriented benchmark suite for competitive programming.<n>It covers four retrieval tasks: two code-centric (i.e., Text-to-Code and Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate and Simplified-to-Full)<n>Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation.
arXiv Detail & Related papers (2025-05-19T10:07:51Z)
PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection [5.383910843560784]
PyResBugs is a curated dataset of residual bugs from major Python frameworks.<n>Each bug is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions.
arXiv Detail & Related papers (2025-05-09T04:39:09Z)
Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces [3.7752830020595796]
We curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases. The dataset comprises 288,617 hacks for 5,578 programming problems. The source code for 2,196 submitted solutions to these problems can be broken with their corresponding hacks.
arXiv Detail & Related papers (2025-03-30T14:50:03Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z)
Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis [6.123324869194196]
We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode. We show ChatGPT solves fewer problems as difficulty rises. Second, prompt engineering improves ChatGPT's performance. Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket.
arXiv Detail & Related papers (2024-11-12T04:01:09Z)
DyPyBench: A Benchmark of Executable Python Software [18.129031749321058]
We present DyPyBench, the first benchmark of Python projects that is large scale, diverse, ready to run and ready to analyze. The benchmark encompasses 50 popular opensource projects from various application domains, with a total of 681k lines of Python code, and 30k test cases. We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.
arXiv Detail & Related papers (2024-03-01T13:53:15Z)
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z)
Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs. For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z)
Tests4Py: A Benchmark for System Testing [11.051969638361012]
Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
arXiv Detail & Related papers (2023-07-11T10:04:52Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
pyWATTS: Python Workflow Automation Tool for Time Series [0.20315704654772418]
pyWATTS is a non-sequential workflow automation tool for the analysis of time series data. pyWATTS includes modules with clearly defined interfaces to enable seamless integration of new or existing methods. pyWATTS supports key Python machine learning libraries such as scikit-learn, PyTorch, and Keras.
arXiv Detail & Related papers (2021-06-18T14:50:11Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython. As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.