Fugu-MT 論文翻訳(概要): Assertion-Aware Test Code Summarization with Large Language Models

論文の概要: Assertion-Aware Test Code Summarization with Large Language Models

arxiv url: http://arxiv.org/abs/2511.06227v1
Date: Sun, 09 Nov 2025 04:58:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:44.822753
Title: Assertion-Aware Test Code Summarization with Large Language Models
Title（参考訳）: 大規模言語モデルを用いたアクセレーションを考慮したテストコード要約
Authors: Anamul Haque Mollah, Ahmed Aljohani, Hyunsook Do,
Abstract要約: 単体テストは、テスト意図を伝える簡潔な要約を欠くことが多い。本稿では,開発者による要約と組み合わせた実世界のJavaテストケース91のベンチマークを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unit tests often lack concise summaries that convey test intent, especially in auto-generated or poorly documented codebases. Large Language Models (LLMs) offer a promising solution, but their effectiveness depends heavily on how they are prompted. Unlike generic code summarization, test-code summarization poses distinct challenges because test methods validate expected behavior through assertions rather than im- plementing functionality. This paper presents a new benchmark of 91 real-world Java test cases paired with developer-written summaries and conducts a controlled ablation study to investigate how test code-related components-such as the method under test (MUT), assertion messages, and assertion semantics-affect the performance of LLM-generated test summaries. We evaluate four code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) across seven prompt configurations using n-gram metrics (BLEU, ROUGE-L, METEOR), semantic similarity (BERTScore), and LLM-based evaluation. Results show that prompting with as- sertion semantics improves summary quality by an average of 0.10 points (2.3%) over full MUT context (4.45 vs. 4.35) while requiring fewer input tokens. Codex and Qwen-Coder achieve the highest alignment with human-written summaries, while DeepSeek underperforms despite high lexical overlap. The replication package is publicly available at https://doi.org/10. 5281/zenodo.17067550
Abstract（参考訳）: ユニットテストは、テスト意図を伝える簡潔な要約を欠くことが多い。大きな言語モデル(LLM)は有望なソリューションを提供するが、その効果はどのように引き起こされるかに大きく依存する。一般的なコード要約とは異なり、テストコードの要約は、テストメソッドがアサーションを通じて期待された振る舞いを検証するので、異なる課題を引き起こす。本稿では,開発者による要約と組み合わせた実世界のJavaテストケース91のベンチマークを新たに提案し,テスト対象のメソッド(MUT)やアサーションメッセージ,アサーションセマンティクスなどのコード関連コンポーネントが,LCM生成したテストサマリのパフォーマンスに与える影響について検討する。我々は、n-gramメトリック(BLEU、ROUGE-L、METEOR)、意味的類似性(BERTScore)、LLMベースの評価を用いて、7つのプロンプト構成で4つのLLM(Codex、Codestral、DeepSeek、Qwen-Coder)を評価した。その結果、 As-Sertion のセマンティクスにより、完全な MUT コンテキスト (4.45 vs. 4.35) よりも平均 0.10 ポイント (2.3%) の要約品質が向上し、少ない入力トークンが要求されることがわかった。 CodexとQwen-Coderは人書きの要約と最高のアライメントを達成する一方、DeepSeekは語彙の重複が高いにもかかわらずパフォーマンスが劣る。レプリケーションパッケージはhttps://doi.org/10.comで公開されている。 5281/禅堂.17067550

論文の概要: Assertion-Aware Test Code Summarization with Large Language Models

関連論文リスト