Fugu-MT 論文翻訳(概要): Evaluating LLM-Based Test Generation Under Software Evolution

論文の概要: Evaluating LLM-Based Test Generation Under Software Evolution

arxiv url: http://arxiv.org/abs/2603.23443v1
Date: Tue, 24 Mar 2026 17:14:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.605929
Title: Evaluating LLM-Based Test Generation Under Software Evolution
Title（参考訳）: ソフトウェア進化下におけるLCMベースのテスト生成の評価
Authors: Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar,
Abstract要約: 大規模言語モデル(LLM)は、ユニットテストの自動生成にますます使われています。プログラム変更によるLCMベースのテスト生成に関する大規模な実証的研究について述べる。
参考スコア（独自算出の注目度）: 7.140756378584939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ユニットテストの自動生成にますます使われています。しかし、これらのテストがプログラム行動に関する真の推論を反映しているか、あるいはトレーニング中に学んだ表面パターンを単純に再現するかは、まだ不明である。後者が支配的であれば、LCMが生成したテストは、カバレッジの削減、回帰の欠落、検出されていない欠陥などの弱点を示す可能性がある。 LLMがテストをどのように生成し、それらのテストがコードの進化にどのように反応するかを理解することが不可欠である。プログラム変更によるLCMベースのテスト生成に関する大規模な実証的研究について述べる。自動突然変異駆動フレームワークを用いて、生成したテストが意味変化(SAC)と意味保存変化(SPC)に対して8つのLSMと22,374のプログラム変異に対してどのように反応するかを分析する。 LLMは、ラインカバレッジが79%、ブランチカバレッジが76%に達し、テストスイートが完全にパスしている。しかし、プログラムが進化するにつれて性能は低下する。 SACでは、新たに生成されたテストの合格率は66%に低下し、ブランチカバレッジは60%に低下した。失敗したSACテストの99%以上は、修正された領域を実行中に元のプログラムをパスし、更新されたセマンティクスに適応するのではなく、元の振る舞いに残留していることを示す。パスレートは79%に低下し、ブランチカバレッジは69%に低下する。 SPC編集はセマンティクスを保存するが、しばしばより大規模な構文変更を導入し、生成されたテストスイートが不安定になる。モデルは、多くのベースラインテストを捨てながら、より多くの新しいテストを生成し、真の意味的な影響ではなく、語彙の変化に対する感受性を示唆する。以上の結果から,現在のLCMベースのテスト生成は,プログラムの進化に伴う回帰認識の維持に大きく依存していることが示唆された。

論文の概要: Evaluating LLM-Based Test Generation Under Software Evolution

関連論文リスト