Fugu-MT 論文翻訳(概要): Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

論文の概要: Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

arxiv url: http://arxiv.org/abs/2604.01437v1
Date: Wed, 01 Apr 2026 22:24:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.081342
Title: Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
Title（参考訳）: ソフトウェア工学におけるエージェントAIの再現性,説明性,有効性評価
Authors: Jingyue Li, André Storhaug,
Abstract要約: エージェントAIを支える大きな言語モデル(LLM)は、しばしばブラックボックスとして機能し、エージェントAIアプローチのベースラインに対する優位性を正当化するのは難しい。本研究は,ICSE 2026,ICSE 2025,FSE 2025,ISSTA 2025の18論文を分析した。この分析は、現在の研究と潜在的将来の研究の両方において、一般的なアプローチと、SEのためのエージェントAIを評価する際の制限を識別する。
参考スコア（独自算出の注目度）: 4.411658619208916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
Abstract（参考訳）: Agentic AIの進歩により、研究者たちは、ソフトウェアエンジニアリング(SE)の課題に対処するために、自律エージェントを活用している。しかしながら、これらのエージェントを支える大きな言語モデル(LLM)は、しばしばブラックボックスとして機能し、エージェントAIアプローチがベースラインよりも優れていることを正当化することは困難である。さらに、評価設計記述に欠落した情報は、しばしば結果の再生を可能としない。本研究は、SEにおけるエージェントAIの現在の評価実践を合成するために、ICSE 2026、ICSE 2025、FSE 2025、ASE 2025、ISSTA 2025で公表または受理されたトピックに関する18の論文を分析した。この分析は、現在の研究と潜在的将来の研究の両方において、一般的なアプローチと、SEのためのエージェントAIを評価する際の制限を識別する。これらの欠点に対処するために,ソフトウェア工学におけるエージェントAIの再現性,説明性,効果的な評価を促進するためのガイドラインと勧告を提案する。特に、エージェントAI研究者は、TAR(Thought-Action-Result)トラジェクトリとLLMインタラクションデータ、あるいはこれらのアーティファクトの要約バージョンを公開して公開することを推奨する。そうすることで、その後の研究は、異なるエージェントAIアプローチの強みと弱みをより効果的に分析できるようになる。このような比較の実現可能性を示すために,本研究では,TARトラジェクトリが系統解析をどのようにサポートするかを示す,概念実証ケーススタディを提案する。

論文の概要: Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

関連論文リスト