Mokav: Execution-driven Differential Testing with LLMs
- URL: http://arxiv.org/abs/2406.10375v2
- Date: Thu, 31 Jul 2025 10:44:12 GMT
- Title: Mokav: Execution-driven Differential Testing with LLMs
- Authors: Khashayar Etemadi, Bardia Mohammadi, Zhendong Su, Martin Monperrus,
- Abstract summary: Mokav is an execution-driven tool that generates difference exposing tests.<n>Makove can generate DETs for 81.7% (1,255/1535) of the program pairs in our benchmark.
- Score: 13.476622148328367
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: It is essential to detect functional differences between programs in various software engineering tasks, such as automated program repair, mutation testing, and code refactoring. The problem of detecting functional differences between two programs can be reduced to searching for a difference exposing test (DET): a test input that results in different outputs on the subject programs. In this paper, we propose Mokav, a novel execution-driven tool that leverages LLMs to generate DETs. Mokav takes two versions of a program (P and Q) and an example test input. When successful, Mokav generates a valid DET, a test input that leads to provably different outputs on P and Q. Mokav iteratively prompts an LLM with a specialized prompt to generate new test inputs. At each iteration, Mokav provides execution-based feedback from previously generated tests until the LLM produces a DET. We evaluate Mokav on 1535 pairs of Python programs collected from the Codeforces competition platform and 32 pairs of programs from the QuixBugs dataset. Our experiments show that Mokav outperforms the state-of-the-art, Pynguin and Differential Prompting, by a large margin. Mokav can generate DETs for 81.7% (1,255/1535) of the program pairs in our benchmark (versus 4.9% for Pynguin and 37.3% for Differential Prompting). We demonstrate that the iterative and execution-driven feedback components of the system contribute to its high effectiveness.
Related papers
- Directed Grammar-Based Test Generation [2.0948216657769616]
This work proposes an automated test generation approach (called FdLoop)<n>FdLoop iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs.<n>We evaluate FdLoop using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software.
arXiv Detail & Related papers (2025-08-02T19:43:15Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.
As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.
By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging [10.334617290353192]
We evaluate whether Scientific computation can help Large Language Models (LLMs) to generate tests for mutants.
LLMs consistently outperform Pynguin in generating tests with better fault detection and coverage.
Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.
arXiv Detail & Related papers (2025-03-11T08:47:13Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)
We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.
We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - $\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation.
The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass.
We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - LLM-Powered Test Case Generation for Detecting Tricky Bugs [30.82169191775785]
AID generates test inputs and oracles targeting plausibly correct programs.
We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus.
The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.
arXiv Detail & Related papers (2024-04-16T06:20:06Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - An Empirical Evaluation of Using Large Language Models for Automated
Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation.
We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package.
We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z) - UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of
Diffusion Models [92.43617471204963]
Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis.
We develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy.
We propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs.
arXiv Detail & Related papers (2023-02-09T18:59:48Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Learning from Self-Sampled Correct and Partially-Correct Programs [96.66452896657991]
We propose to let the model perform sampling during training and learn from both self-sampled fully-correct programs and partially-correct programs.
We show that our use of self-sampled correct and partially-correct programs can benefit learning and help guide the sampling process.
Our proposed method improves the pass@k performance by 3.1% to 12.3% compared to learning from a single reference program with MLE.
arXiv Detail & Related papers (2022-05-28T03:31:07Z) - Design and Implementation of a Quantum Kernel for Natural Language
Processing [0.8702432681310401]
This thesis leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks.
Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test.
The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 pm 0.01$%.
arXiv Detail & Related papers (2022-05-13T00:45:46Z) - What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness.
We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.