Related papers: Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation

Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation

URL: http://arxiv.org/abs/2602.08146v2
Date: Tue, 10 Feb 2026 04:06:28 GMT
Title: Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation
Authors: Pengyu Chang, Yixiong Fang, Silin Chen, Yuling Shi, Beijun Shen, Xiaodong Gu,
Abstract summary: Large language model (LLM)-based methods generate more human-readable tests but often suffer from low coverage and compilability.<n>We propose AdverTest, a novel adversarial framework for LLM-powered test case generation.<n>We show that our approach improves fault detection rates by 8.56% over the best existing LLM-based methods and by 63.30% over EvoSuite.
Score: 9.439427795905637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software testing is a critical, yet resource-intensive phase of the software development lifecycle. Over the years, various automated tools have been developed to aid in this process. Search-based approaches typically achieve high coverage but produce tests with low readability, whereas large language model (LLM)-based methods generate more human-readable tests but often suffer from low coverage and compilability. While the majority of research efforts have focused on improving test coverage and readability, little attention has been paid to enhancing the robustness of bug detection, particularly in exposing corner cases and vulnerable execution paths. To address this gap, we propose AdverTest, a novel adversarial framework for LLM-powered test case generation. AdverTest comprises two interacting agents: a test case generation agent (T) and a mutant generation agent (M). These agents engage in an adversarial loop, where M persistently creates new mutants "hacking" the blind spots of T's current test suite, while T iteratively refines its test cases to "kill" the challenging mutants produced by M. This interaction loop is guided by both coverage and mutation scores, enabling the system to co-evolve toward both high test coverage and bug detection capability. Experimental results in the Defects4J dataset show that our approach improves fault detection rates by 8.56% over the best existing LLM-based methods and by 63.30% over EvoSuite, while also improving line and branch coverage.

Related papers

Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
ATGen: Adversarial Reinforcement Learning for Test Case Generation [78.48498301767079]
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
arXiv Detail & Related papers (2025-10-16T12:49:25Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation [56.84049855266145]
We propose a Multi-feedback Smart Contract Fuzzing framework (LLAMA) that integrates evolutionary mutation strategies, and hybrid testing techniques.<n>LLAMA achieves 91% instruction coverage and 90% branch coverage, while detecting 132 out of 148 known vulnerabilities.<n>These results highlight LLAMA's effectiveness, adaptability, and practicality in real-world smart contract security testing scenarios.
arXiv Detail & Related papers (2025-07-16T09:46:58Z)
PALM: Synergizing Program Analysis and LLMs to Enhance Rust Unit Test Coverage [14.702182387149547]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 15 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z)
Mutation-Guided Unit Test Generation with a Large Language Model [7.3946430511009735]
MUTGEN is a mutation-guided, LLM-based test generation approach.<n>It significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score.
arXiv Detail & Related papers (2025-06-03T14:47:22Z)
PRIMG : Efficient LLM-driven Test Generation Using Mutant Prioritization [0.0]
PRIMG (Prioritization and Refinement Integrated Mutation-driven Generation) is a novel framework for incremental and adaptive test case generation for Solidity smart contracts.<n> PRIMG integrates a mutation prioritization module, which employs a machine learning model trained on mutant subsumption graphs to predict the usefulness of surviving mutants.<n>The prioritization module consistently outperformed random mutant selection, enabling the generation of high-impact tests with reduced computational effort.
arXiv Detail & Related papers (2025-05-08T18:30:22Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.