Fugu-MT 論文翻訳(概要): PR-Aware Automated Unit Test Generation: Challenges and Opportunities

論文の概要: PR-Aware Automated Unit Test Generation: Challenges and Opportunities

arxiv url: http://arxiv.org/abs/2605.25285v1
Date: Sun, 24 May 2026 22:38:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.063028
Title: PR-Aware Automated Unit Test Generation: Challenges and Opportunities
Title（参考訳）: PR対応自動ユニットテスト生成の課題と機会
Authors: Vahid Haratian, Atakan Akar, Berk Çakar, Eray Tüzün,
Abstract要約: 本研究では,主要な検索ツールであるEvoSuiteと,広く使用されている大規模言語モデル(LLM)の1つであるGPT-4oの2つのアプローチを評価する。私たちの研究は、現代のソフトウェア開発の漸進的な性質に合わせて、ツールと高性能なテストジェネレータの開発における重要なギャップを強調します。
参考スコア（独自算出の注目度）: 1.3241176321860364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated test generation has a substantial body of work, yet most studies focus on generating tests for complete software units, such as classes, and rely on metrics such as code coverage for assessment. In contrast, modern software development primarily evolves through small, targeted changes introduced in pull requests (PRs). Despite this, the crucial task of generating tests specifically for these PRs has been overlooked, and the performance of state-of-the-art tools for this purpose remains unknown. This study evaluates two distinct approaches for PR-aware test generation: EvoSuite, a leading search-based tool, and GPT-4o, one of the widely used large language models (LLMs). To measure their effectiveness at validating PR-specific changes, we assess their ability to generate fail-to-pass (F2P) test cases, meaning tests that fail on the code before the change and pass on the code after the change. Our evaluation shows that EvoSuite outperformed GPT-4o, producing at least one F2P test for a significantly higher percentage of PRs (36 percent vs. 13 percent). The performance of GPT-4o was significantly hampered by a high rate of compilation errors (63 percent), whereas only 2 percent of EvoSuite's generated tests failed to run. Despite EvoSuite's relative success, our findings indicate that both tools are largely ineffective for this task, as they failed to generate any meaningful change-capturing tests for the large majority of the PRs (64 percent). Although both generators could not achieve a high F2P ratio in our evaluation, and EvoSuite outperformed GPT-4o, we believe that agentic code generation methods may have significant potential for this task. Ultimately, our work highlights a critical gap in tooling and calls for the development of high-performance test generators tailored to the incremental nature of modern software development.
Abstract（参考訳）: 自動テスト生成にはかなりの作業量があるが、ほとんどの研究は、クラスのような完全なソフトウェアユニットのテストの生成に重点を置いており、アセスメントのためのコードカバレッジのようなメトリクスに依存している。対照的に、現代のソフトウェア開発は主にプルリクエスト(PR)で導入された小さなターゲット変更を通じて進化します。それにもかかわらず、これらのPRに特化してテストを生成するという重要なタスクは見過ごされ、この目的のための最先端ツールのパフォーマンスは未だに不明である。本研究では,主要な検索ツールであるEvoSuiteと,広く使用されている大規模言語モデル(LLM)の1つであるGPT-4oの2つの異なるアプローチを評価する。 PR固有の変更を検証する上での有効性を評価するため、変更前のコードでフェール・ツー・パス(F2P)テストケースを生成し、変更後のコードをパスする能力を評価する。評価の結果,EvoSuiteはGPT-4oより優れており,少なくとも1回のF2P試験はPRの割合が有意に高い(36%対13%)。 GPT-4oの性能は高いコンパイルエラー(33%)で著しく損なわれ、EvoSuiteが生成したテストの2%しか実行できなかった。 EvoSuiteの相対的な成功にもかかわらず、この2つのツールは、PRの大部分(64%)に対して有意義な変更キャプチャーテストを生成しなかったため、このタスクにはほとんど効果がない。どちらのジェネレータも高いF2P比が得られず,EvoSuiteはGPT-4oよりも優れていた。最終的に、我々の研究は、現代のソフトウェア開発の漸進的な性質に合わせて、ツールと高性能なテストジェネレータの開発における重要なギャップを強調します。

論文の概要: PR-Aware Automated Unit Test Generation: Challenges and Opportunities

関連論文リスト