HardTests: Synthesizing High-Quality Test Cases for LLM Coding
- URL: http://arxiv.org/abs/2505.24098v1
- Date: Fri, 30 May 2025 01:00:34 GMT
- Title: HardTests: Synthesizing High-Quality Test Cases for LLM Coding
- Authors: Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li,
- Abstract summary: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning.<n>We propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs.
- Score: 14.561428626993326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.
Related papers
- CodeContests+: High-Quality Test Case Generation for Competitive Programming [14.602111331209203]
We introduce an agent system that creates high-quality test cases for competitive programming problems.<n>We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+.<n>The results indicate that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR)
arXiv Detail & Related papers (2025-06-06T07:29:01Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis [31.953858122298517]
We propose a novel inference scaling strategy, CoT-based Synthesizer.<n>It synthesizes superior answers by analyzing complementary information from multiple candidate responses.<n>We show that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o.
arXiv Detail & Related papers (2025-01-03T06:50:06Z) - Measuring the Influence of Incorrect Code on Test Generation [22.168699378889148]
We show that tests generated for incorrect code experience a 47% worse bug detection rate.<n>Improvements of +18% in accuracy, +4% coverage, and +34% in bug detection can be achieved by providing natural language code descriptions.
arXiv Detail & Related papers (2024-09-14T15:17:34Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.<n>Our method is able to work under gray-box conditions without access to model training data or weights.<n>We evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.517293765116307]
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected.<n>This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.