Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning
- URL: http://arxiv.org/abs/2508.05710v2
- Date: Thu, 11 Sep 2025 02:44:37 GMT
- Title: Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning
- Authors: Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, Guorui Zhou,
- Abstract summary: We present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases.<n>The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment.<n>In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution.
- Score: 43.30900834053253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: https://github.com/Kwai-Klear/CodeTest.
Related papers
- CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions [8.163435280190027]
Existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass.<n>CodeHacker generates adversarial test cases that expose latent vulnerabilities in program submissions.<n>Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets.
arXiv Detail & Related papers (2026-02-23T05:59:30Z) - Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions [1.9196411948992402]
We present ConVerTest, a novel two-stage pipeline for synthesizing reliable tests without requiring prior code implementations.<n>Experiments on BIGCODEBENCH and LESS BASIC PYTHON PROBLEMS benchmarks demonstrate that ConVerTest improves test validity, line coverage, and mutation scores by up to 39%, 28%, and 18% respectively.
arXiv Detail & Related papers (2026-02-11T04:40:38Z) - Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model [60.60587869092729]
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment.<n>We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation.
arXiv Detail & Related papers (2026-02-07T07:42:07Z) - ATGen: Adversarial Reinforcement Learning for Test Case Generation [78.48498301767079]
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
arXiv Detail & Related papers (2025-10-16T12:49:25Z) - Rethinking Verification for LLM Code Generation: From Generation to Testing [44.46778801679273]
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench.<n>We propose a new multi-dimensional metrics designed to rigorously quantify test-suite.<n> Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench.
arXiv Detail & Related papers (2025-07-09T14:58:47Z) - CodeContests+: High-Quality Test Case Generation for Competitive Programming [14.602111331209203]
We introduce an agent system that creates high-quality test cases for competitive programming problems.<n>We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+.<n>The results indicate that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR)
arXiv Detail & Related papers (2025-06-06T07:29:01Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.<n>CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.<n>We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.<n>We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.