Fugu-MT 論文翻訳(概要): Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

論文の概要: Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

arxiv url: http://arxiv.org/abs/2604.24621v1
Date: Mon, 27 Apr 2026 15:51:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.129294
Title: Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
Title（参考訳）: LLMベースのソフトウェアエンジニアリングツールの評価--実践,課題,今後の方向性
Authors: Utku Boran Torun, Veli Karakaya, Ali Babar, Eray Tüzün,
Abstract要約: 大規模言語モデル(LLM)は、ソフトウェア工学(SE)ツールにますます組み込まれています。信頼性評価がLCMツールの信頼性,採用,有意義な評価に欠かせない理由を論じる。
参考スコア（独自算出の注目度）: 1.9774267722954466
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are increasingly embedded in software engineering (SE) tools, powering applications such as code generation, automated code review, and bug triage. As these LLM-based AI for Software Engineering (AI4SE) systems transition from experimental prototypes to widely deployed tools, the question of what it means to evaluate their behavior reliably has become both critical and unanswered. Unlike traditional SE or machine learning systems, LLM-based tools often produce open-ended, natural language outputs, admit multiple valid answers, and exhibit non-deterministic behavior across runs. These characteristics fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. In this paper, we examine LLM evaluation as a general, task-dependent concept through the lens of SE tasks. We discuss why reliable evaluation is essential for trust, adoption, and meaningful assessment of LLM-based tools, summarize the current state of evaluation practices, and highlight their limitations in realistic AI4SE settings. We then identify key challenges facing current approaches, including the absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability due to non-determinism, limitations of automated and model-based evaluation, and fragmentation of evaluation practices. Finally, we outline future directions aimed at advancing LLM evaluation toward more robust, scalable, and trustworthy methodologies, to stimulate discussion on principled evaluation practices that can keep pace with the growing role of LLMs in SE.
Abstract（参考訳）: 大規模言語モデル(LLM)は、コード生成や自動コードレビュー、バグトリアージといったアプリケーションを動かすソフトウェア工学(SE)ツールにますます組み込まれています。これらのLLMベースのAI for Software Engineering(AI4SE)システムは、実験的なプロトタイプから広くデプロイされたツールへと移行するので、その振る舞いを確実に評価することの意味は、批判的かつ未解決なものになっている。従来のSEや機械学習システムとは異なり、LLMベースのツールは、しばしばオープンエンドの自然言語出力を生成し、複数の有効な回答を認め、実行中に非決定論的動作を示す。これらの特徴は、単一基底真理の存在、決定論的アウトプット、客観的正当性といった長年にわたる評価前提に基本的に挑戦する。本稿では,SEタスクのレンズによる汎用的タスク依存概念としてのLCM評価について検討する。信頼性評価がLLMベースのツールの信頼性,採用,有意義な評価に欠かせない理由を論じ,評価プラクティスの現状を要約し,現実的なAI4SE設定においてその限界を強調する。次に, 現状の課題として, 安定基盤の真理の欠如, 主観性と多次元品質の欠如, 非決定性による評価不安定性, 自動およびモデルに基づく評価の限界, 評価実践の断片化などを挙げる。最後に,より堅牢で,スケーラブルで,信頼性の高い方法論に向けたLCM評価の推進を目的とした今後の方向性を概説し,SEにおけるLSMの役割の増大に追随できるような,原則化された評価実践に関する議論を刺激する。

論文の概要: Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

関連論文リスト