Fugu-MT 論文翻訳(概要): How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

論文の概要: How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

arxiv url: http://arxiv.org/abs/2510.08720v1
Date: Thu, 09 Oct 2025 18:29:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.488504
Title: How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
Title（参考訳）: コードケースとテストケースはいくつあるか? バイナリマトリックスから見たテストケースの評価
Authors: Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che,
Abstract要約: LLM(Large Language Models)が自動生成するテストケースの評価は、非常に難しい作業です。既存のベンチマークは高い計算コスト、インフレーションのスコア、稀でクリティカルな欠陥に対する自明なバグに対するバイアスに悩まされている。本稿では,ベンチマーク構築をバイナリコードテスト行列の最適な診断基準として定式化するフレームワークを提案する。
参考スコア（独自算出の注目度）: 51.30005925128432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.
Abstract（参考訳）: LLM(Large Language Models)が自動生成するテストケースの評価は、非常に重要な作業である。既存のベンチマークは高い計算コスト、インフレーションのスコア、稀でクリティカルな欠陥に対する自明なバグに対するバイアスに悩まされている。 1) エラー空間全体を表現するのに十分な誤りコードの最小セットは何か? そして (2) それらを区別するのに必要となるテストケースの最小セットは何か? 本稿では,ベンチマーク構築をバイナリコードテスト行列の最適な診断基準として定式化するフレームワークを提案する。この行列のランクは、独立したエラーパターン(短い符号)の最小数を特定し、完全なフォールトカバレッジに必要なテストケースの数に厳密な上限を与える。我々の目的は、内部の多様性を最大化する行列ランクと等しい大きさの基底を特定することである。このNPハード問題に対処するため,最大多様な誤りコードを選択するための効率的な近似アルゴリズムであるWrongSelectを提案する。このフレームワークを何百万もの競合するプログラムに応用し、コンパクトで多様性があり、インフレーションに耐性のあるベンチマークであるTC-Benchを構築します。大規模な実験では、最も先進的なテストケース生成手法でさえTC-Benchの排他率を60%程度しか達成せず、診断能力に重大なギャップがあることが示されている。私たちのデータセットは、https://huggingface.co/datasets/Luoberta/TC-Benchで、コードは、https://github.com/Luowaterbi/TC-Benchで利用可能です。

論文の概要: How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

関連論文リスト