Fugu-MT 論文翻訳(概要): Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

論文の概要: Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

arxiv url: http://arxiv.org/abs/2603.11262v1
Date: Wed, 11 Mar 2026 19:45:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.616464
Title: Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques
Title（参考訳）: パッチオーバーフィッティング検出技術の実用的欠点
Authors: David Williams, Ioakim Avraam, Aldeida Aleti, Matias Martinez, Justyna Petke, Federica Sarro,
Abstract要約: 本研究は,いくつかのパッチオーバーフィッティング検出(POD)手法の総合的なベンチマーク研究を現実的なシナリオで行った最初のものである。我々は、ランダムサンプリングに基づく6つの最先端のPODアプローチを2つのベースラインに対してベンチマークする。単純なランダム選択は、すべてのPODツールを71%から96%のケースで上回ります。
参考スコア（独自算出の注目度）: 14.046456257175238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.
Abstract（参考訳）: 自動プログラム修正(APR)は、開発者がデバッグに費やす時間を短縮し、ソフトウェア開発の他の側面に集中できるようにする。自動生成されたバグパッチは通常、ソフトウェアテストによって検証される。しかし、この方法はパッチ過適合、すなわち、与えられたテストに合格するが、まだ正しくないパッチを生成することに繋がる。過剰適合するパッチを特定するためにパッチ正当性評価(オーバーフィッティング検出とも呼ばれる)技術が提案されている。しかし、以前の研究では、APRツールが一般的な用途で生成するであろうパッチの分布を反映しないデータセットや、これらのテクニックの分離による有効性を評価することが多かったため、実際にはその有効性は分かっていない。本研究は,いくつかのパッチオーバーフィッティング検出(POD)手法の総合的なベンチマーク研究を現実的なシナリオで行った最初のものである。この目的のために、現実的な仮定(つまり、ツールが同じ実験条件下で実行するパッチ)を反映したデータセットをキュレートする。次に、これらのデータを使用して、静的分析、動的テスト、学習ベースのアプローチの6つの最先端のPODアプローチを、ランダムサンプリングに基づく2つのベースライン(ひとつは以前の作業から、1つはここで提案されている)に対してベンチマークします。単純なランダム選択はPODツールによって71%から96%のケースですべてのPODツールより優れています。これは,(1)現在のPODツールは,新しい技術の必要性を強調して,限られた実用的利益をもたらすこと,(2)現実的なデータやランダムサンプリングに対して,その実用性を証明するために,任意のPODツールをベンチマークしなくてはならないこと,の2点を示唆している。この目的のために、我々はAPRコミュニティに対して、POD技術の改善を継続し、実用的なベンチマークのための提案された方法論を採用することを奨励します。

論文の概要: Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

関連論文リスト