Fugu-MT 論文翻訳(概要): Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

論文の概要: Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

arxiv url: http://arxiv.org/abs/2604.26674v1
Date: Wed, 29 Apr 2026 13:46:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.425426
Title: Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset
Title（参考訳）: 再現可能な自動プログラム修復は難しい -- Defects4Jデータセットを使った経験
Authors: Adam Krafczyk, Klaus Schmid,
Abstract要約: Google Scholarによると、広く使われているベンチマークDefects4Jについて調査する。 APR設定の厳格な要件でテストスイートを実行すると、欠陥の180(21.6)は評価実験には適さないことがわかった。 59(7.1の欠陥)が追加され、コードベースから1つのステートメントを削除することですべてのテストケースが通過するが、人間によるパッチはコードを削除しない。
参考スコア（独自算出の注目度）: 0.45835414225547183
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.
Abstract（参考訳）: 自動プログラム修復(APR)の研究では、既知の欠陥とテストスイートを組み合わせたベンチマークデータセットが重要視されている。それらは異なるAPRアプローチのエビデンスベースの比較を可能にする。 APRに関する私たちの自身の研究で、広く使用されている欠陥データセットを扱う際の重大な課題を見つけました。我々は、これらの特定された課題と関連する教訓を要約し、それらをAPRコミュニティの注意に向け、それらが与える潜在的な影響を定量化する。特に,Google Scholarによると,広く使用されているベンチマークDefects4Jについて,1,800以上の引用がある。 17のオープンソースプロジェクトからの835の欠陥、手作業による欠陥の収集、欠陥を明確に示すテストスイート、無関係な変更が削除された人間のパッチで構成されている。 APR設定で再現性に関する厳格な要件でテストスイートを実行する場合(テストケース経由で欠陥を再現する以外に)、180 (21.6 %)の欠陥は評価実験には適さないことがわかった。さらに、59 (7.1 %) の欠陥が、コードベースから1つのステートメントを削除することですべてのテストケースが通過するので、明らかに不特定なテストスイートを持っていることが分かりました。私たちのコントリビューションは、従来の欠陥の再現性を超えた、APRの欠陥データセットの体系的な要件の収集、Defects4Jデータセットの問題の実践経験と定量的分析、およびJavaプログラムのためのAPRツールの評価フレームワークの実装です。この評価フレームワークは、不適切なテストスイートの表示を厳格にチェックする。

論文の概要: Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

関連論文リスト