Fugu-MT 論文翻訳(概要): LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

論文の概要: LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

arxiv url: http://arxiv.org/abs/2510.24367v1
Date: Tue, 28 Oct 2025 12:44:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.115294
Title: LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
Title（参考訳）: LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
Authors: Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo,
Abstract要約: 本稿では,ソフトウェアアーティファクト評価のためのLCM-as-a-Judgeの推進に向けて,コミュニティを支援することを目的とする。我々はこれらのフレームワークを,2030年までに一貫性のある多面的アーティファクト評価が可能な信頼性,堅牢,スケーラブルなヒューマンサロゲートとして想定する。
参考スコア（独自算出の注目度）: 27.124885915455426
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.
Abstract（参考訳）: 大規模言語モデル(LLM)をソフトウェア工学(SE)に迅速に統合することは、コード生成のようなタスクに革命をもたらし、大量のソフトウェアアーチファクトを生み出しました。この急上昇は、これらのアウトプットを評価するためのスケーラブルで信頼性の高いメソッドの欠如という、重大なボトルネックを露呈した。 BLEUのような従来の自動メトリクスは、微妙な品質の面を捉えていない。 LLM-as-a-Judgeパラダイム – 自動評価にLLMを使用する – が登場した。このアプローチはLLMの高度な推論を活用し、自動化されたスケールでの人間のようなニュアンスへの道を提供する。しかし、SEにおけるLSM-as-a-Judge研究はまだ初期段階にある。この先見的なSE 2030の論文は、LSM生成したソフトウェアアーティファクトを評価するために、LSM-as-a-Judgeを前進させるコミュニティを支援することを目的としている。既存のSE研究の文献レビューを行い、その限界を分析し、主要な研究ギャップを特定し、詳細なロードマップを概説する。我々はこれらのフレームワークを,2030年までに一貫性のある多面的アーティファクト評価が可能な信頼性,堅牢,スケーラブルなヒューマンサロゲートとして想定する。我々の研究は、LLM-as-a-Judgeフレームワークの研究と採用を促進し、最終的にソフトウェアアーチファクト評価のスケーラビリティを向上させることを目的としています。

論文の概要: LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

関連論文リスト