Fugu-MT 論文翻訳(概要): PBT-Bench: Benchmarking AI Agents on Property-Based Testing

論文の概要: PBT-Bench: Benchmarking AI Agents on Property-Based Testing

arxiv url: http://arxiv.org/abs/2605.15229v2
Date: Wed, 20 May 2026 00:07:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.201833
Title: PBT-Bench: Benchmarking AI Agents on Property-Based Testing
Title（参考訳）: PBT-Bench: プロパティベースのテストにおけるAIエージェントのベンチマーク
Authors: Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du,
Abstract要約: PBT-Benchは、40の実際のPythonライブラリにまたがる100のプロパティベースのテスト問題のベンチマークである。各問題は1つ以上のセマンティックなバグ(総数365、平均3.65)を注入し、デフォルトのストラテジーなランダムな入力がほとんど起こらないように設計する。 PBT指導によるバグリコールは42.1%から83.4%の範囲で、オープンエンドベースラインでは31.4%から76.7%である。
参考スコア（独自算出の注目度）: 29.035258104995204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.
Abstract（参考訳）: 既存のコードベンチマークは、エージェントが既知のバグを再現するテストを生成することができるか、または、記述された問題を修正するパッチを生成することができるかを測定する。文書から意味的不変性を導出し、ランダムな検索で違反を明らかにするのに十分な入力生成戦略を構築する。 PBT-Benchは、40の実際のPythonライブラリにまたがる100のプロパティベースのテスト問題のベンチマークである。各問題は1つ以上のセマンティックなバグ(総数365、平均3.65)を注入し、デフォルトのストラテジーなランダムな入力がほとんどトリガーを起こさないように設計する。バグは、単一制約境界バグからステートフルでクロスファンクショナルなプロトコル違反まで、三つの困難レベル(L1-L3)に階層化されている。我々は,2つのプロンプトレギュレーション(オープンエンドベースライン対明示的仮説足場)の下で,構成毎に3つの独立ランについて8つの現代LPMを評価した。 PBT指導によるバグリコールは42.1%から83.4%の範囲で、オープンエンドベースラインでは31.4%から76.7%である。仮説の足場は、能力の中間モデルを20ポイント以上引き上げるが、最強モデルの利得は小さく、劣化を示す2つの例外は、構造されたプロンプトがそれを補完するよりも、特定のモデル行動に干渉する可能性があることを示唆している。異なるアーキテクチャは異なる問題で失敗し、単一のモデルが閉じることのない永続的なギャップを残します。ベンチマーク、ハーネス、そして完全な評価コーパスをリリースし、文書化されたセマンティック推論の下流での作業をサポートする。

論文の概要: PBT-Bench: Benchmarking AI Agents on Property-Based Testing

関連論文リスト