Related papers: LLMorpheus: Mutation Testing using Large Language Models

LLMorpheus: Mutation Testing using Large Language Models

URL: http://arxiv.org/abs/2404.09952v1
Date: Mon, 15 Apr 2024 17:25:14 GMT
Title: LLMorpheus: Mutation Testing using Large Language Models
Authors: Frank Tip, Jonathan Bell, Max Schäfer,
Abstract summary: This paper presents a technique where a Large Language Model (LLM) is prompted to suggest mutations by asking it what placeholders that have been inserted in source code could be replaced with. We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS, a state-of-the-art mutation testing tool.
Score: 7.312170216336085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-" or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique where a Large Language Model (LLM) is prompted to suggest mutations by asking it what placeholders that have been inserted in source code could be replaced with. The technique is implemented in LLMorpheus, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by LLMorpheus, demonstrating its practicality.

Related papers

LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation [56.84049855266145]
We propose a Multi-feedback Smart Contract Fuzzing framework (LLAMA) that integrates evolutionary mutation strategies, and hybrid testing techniques.<n>LLAMA achieves 91% instruction coverage and 90% branch coverage, while detecting 132 out of 148 known vulnerabilities.<n>These results highlight LLAMA's effectiveness, adaptability, and practicality in real-world smart contract security testing scenarios.
arXiv Detail & Related papers (2025-07-16T09:46:58Z)
Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging [10.334617290353192]
We evaluate whether Scientific computation can help Large Language Models (LLMs) to generate tests for mutants. LLMs consistently outperform Pynguin in generating tests with better fault detection and coverage. Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.
arXiv Detail & Related papers (2025-03-11T08:47:13Z)
Fine-Tuning LLMs for Code Mutation: A New Era of Cyber Threats [0.9208007322096533]
This paper explores the application of Large Language Models in the context of code mutation. Traditionally, code mutation has been employed to increase software robustness in mission-critical applications. We propose a novel definition of code mutation training tailored for pre-trained LLM-based code synthesizers.
arXiv Detail & Related papers (2024-10-29T17:43:06Z)
FuzzCoder: Byte-level Fuzzing Test via Large Language Model [46.18191648883695]
We propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program.
arXiv Detail & Related papers (2024-09-03T14:40:31Z)
An Exploratory Study on Using Large Language Models for Mutation Testing [32.91472707292504]
Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. This paper investigates the performance of LLMs in generating effective mutations to their usability, fault detection potential, and relationship with real bugs. We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs.
arXiv Detail & Related papers (2024-06-14T08:49:41Z)
MMT: Mutation Testing of Java Bytecode with Model Transformation -- An Illustrative Demonstration [0.11470070927586014]
mutation testing is an approach to check the robustness of test suites. We propose a model-driven approach where mutations of Java bytecode can be flexibly defined by model transformation. The corresponding tool called MMT has been extended with advanced mutation operators for modifying object-oriented structures.
arXiv Detail & Related papers (2024-04-22T11:33:21Z)
An Empirical Evaluation of Manually Created Equivalent Mutants [54.02049952279685]
Less than 10 % of manually created mutants are equivalent. Surprisingly, our findings indicate that a significant portion of developers struggle to accurately identify equivalent mutants.
arXiv Detail & Related papers (2024-04-14T13:04:10Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Contextual Predictive Mutation Testing [17.832774161583036]
We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance.
arXiv Detail & Related papers (2023-09-05T17:00:15Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
Fuzzing for CPS Mutation Testing [3.512722797771289]
We propose a mutation testing approach that leverages fuzz testing, which has proved effective with C and C++ software. Our empirical evaluation shows that mutation testing based on fuzz testing kills a significantly higher proportion of live mutants than symbolic execution.
arXiv Detail & Related papers (2023-08-15T16:35:31Z)
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
DeepMutants: Training neural bug detectors with contextual mutations [0.799536002595393]
Learning-based bug detectors promise to find bugs in large code bases by exploiting natural hints. Still, existing techniques tend to underperform when presented with realistic bugs. We propose a novel contextual mutation operator which dynamically injects natural and more realistic faults into code.
arXiv Detail & Related papers (2021-07-14T12:45:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.