Fugu-MT 論文翻訳(概要): Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

論文の概要: Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

arxiv url: http://arxiv.org/abs/2502.06193v1
Date: Mon, 10 Feb 2025 06:49:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-11 18:57:50.924714
Title: Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Title（参考訳）: LLMは人間の評価装置を置き換えることができるか? ソフトウェア工学におけるLCM-as-a-Judgeの実証研究
Authors: Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, Xin Xia,
Abstract要約: 大規模言語モデル(LLM)は、コード生成のような様々なソフトウェアエンジニアリング(SE)タスクに取り組むためにデプロイされている。 Pass@kメトリックは、広範囲なユニットテストと設定された環境を必要とし、LLM生成したテキストの評価には適していない。 BLEUのような従来のメトリクスは、意味的類似性ではなく語彙のみを測定するが、精査されている。
参考スコア（独自算出の注目度）: 18.766132076075365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide...
Abstract（参考訳）: 最近、コード生成のような様々なソフトウェアエンジニアリング(SE)タスクに対処するために、大規模な言語モデル(LLM)がデプロイされ、SEタスクの自動化が大幅に進んでいる。しかし、これらのLLM生成コードとテキストの品質を評価することは依然として困難である。一般的に使用されるPass@kメトリクスは、広範囲なユニットテストと構成された環境を必要とし、高い労働コストを要求し、LLM生成したテキストを評価するには適さない。 BLEUのような従来のメトリクスは、意味的類似性ではなく語彙のみを測定するが、精査されている。これに対し、LSMを自動評価に使用する新しい傾向が出現し、LSM-as-a-judgeとして知られている。これらのLCM-as-a-judge法は、高品質な基準回答に頼ることなく、従来の指標よりも人間の評価をよりよく模倣していると主張されている。それでも、SEタスクにおける正確な人間のアライメントは未解明のままである。本稿では,SEタスク評価のためのLCM-as-a-judge手法を実証的に検討し,人間の判断との整合性に着目した。汎用LSMを利用する7つのLSM-as-a-judge法と、特に評価のために微調整された2つのLSMを選択する。コード翻訳,コード生成,コード要約の3つのSEデータセット上でLLM応答を生成し,手動で評価した後,各応答を評価するように促す。最後に,これらの手法によって生成されたスコアと人的評価を比較した。その結果,コード翻訳と生成におけるPearsonの相関は81.32と68.51が最も高く,ChrF++を34.23と64.92で著しく上回っていることがわかった。このようなアウトプットベースの手法は、LCMに直接判断を出力させ、人間のスコアパターンに似たバランスの取れたスコア分布を示す。最後に、我々は...

論文の概要: Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

関連論文リスト