Related papers: ATGen: Adversarial Reinforcement Learning for Test Case Generation

ATGen: Adversarial Reinforcement Learning for Test Case Generation

URL: http://arxiv.org/abs/2510.14635v1
Date: Thu, 16 Oct 2025 12:49:25 GMT
Title: ATGen: Adversarial Reinforcement Learning for Test Case Generation
Authors: Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang,
Abstract summary: Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
Score: 78.48498301767079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a ``fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning. ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty challenging current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximize ``Output Accuracy'' and ``Attack Success'', enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.

Related papers

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z)
TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models [26.385183692191873]
Large Language Models (LLMs) are changing the coding paradigm, yet synthesizingically sophisticated and robust code still remains a critical challenge.<n>We propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fineTuning (TAROT) to address this need.
arXiv Detail & Related papers (2026-02-17T09:29:18Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [75.52891348667491]
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics.<n>The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response.<n>We propose Reinforcement Learning with Adrial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification.
arXiv Detail & Related papers (2025-11-03T17:15:05Z)
Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z)
VERIRL: Boosting the LLM-based Verilog Code Generation via Reinforcement Learning [32.974199255760944]
We introduce a reinforcement learning framework tailored for Verilog code generation.<n>To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism.<n>To mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy.
arXiv Detail & Related papers (2025-08-25T20:20:44Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
GenX: Mastering Code and Test Generation with Execution Feedback [7.225594526057816]
We propose a novel approach that concurrently trains a code generation model and a test generation model.<n>We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking.<n>The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.
arXiv Detail & Related papers (2024-12-18T03:18:21Z)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.<n>We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.<n>We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z)
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases.<n>LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices.<n>We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.