How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
- URL: http://arxiv.org/abs/2510.08720v1
- Date: Thu, 09 Oct 2025 18:29:24 GMT
- Title: How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
- Authors: Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che,
- Abstract summary: evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task.<n>Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults.<n>We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix.
- Score: 51.30005925128432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.
Related papers
- CodeContests-O: Powering LLMs via Feedback-Driven Iterative Test Case Generation [71.42965967582147]
Existing approaches attempt to synthesize test cases using Large Language Models (LLMs)<n>We propose a $textbfFeedback-Bench Iterative Framework$ for comprehensive test case construction.<n>Our dataset achieves an average True Positive Rate (TPR) of $89.37%$ and True Negative Rate (TNR) of $90.89%$, significantly outperforming the CodeContests and CodeContests+ by margins of $4.32%$ and $9.37%$, respectively.
arXiv Detail & Related papers (2026-01-20T07:32:44Z) - Quantum-Guided Test Case Minimization for LLM-Based Code Generation [0.42970700836450487]
We introduce a framework based on Test-Driven Development (TDD) that transforms code specification into an engineering optimization task.<n>The framework first prompts an LLM to generate a test suite, then formulates the Test Case Minimization (TCM) problem as a Quadratic Unconstrained Binary Optimization (QUBO) model.<n>This work demonstrates a powerful synergy between generative AI and optimization in software engineering, highlighting the critical importance of precise model formulation.
arXiv Detail & Related papers (2025-11-19T18:00:44Z) - Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts [49.99400612296149]
We find that models can ace many benchmarks without strong visual understanding.<n>This is especially problematic for vision-centric benchmarks that are meant to require visual inputs.<n>We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be gamed.
arXiv Detail & Related papers (2025-11-06T18:43:21Z) - SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two algorithms that enjoy provable scaling laws for the test-time compute of large language models.<n>One is a two-stage knockout-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.<n>The other is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.
arXiv Detail & Related papers (2024-11-29T05:29:47Z) - B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests [16.19318541132026]
We show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests.
We propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge.
arXiv Detail & Related papers (2024-09-13T10:22:08Z) - NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts [31.783388267874738]
We propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks.
NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services.
Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant.
arXiv Detail & Related papers (2024-05-07T17:52:51Z) - Test2Vec: An Execution Trace Embedding for Test Case Prioritization [12.624724734296342]
Execution traces of test cases can be a good alternative to abstract their behavior for automated testing tasks.
We propose a novel embedding approach, Test2Vec, that maps test execution traces to a latent space.
Results show that our proposed TP improves best alternatives by 41.80% in terms of the median normalized rank of the first failing test case.
arXiv Detail & Related papers (2022-06-28T20:38:36Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.