Fugu-MT 論文翻訳(概要): On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

論文の概要: On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

arxiv url: http://arxiv.org/abs/2507.16587v1
Date: Tue, 22 Jul 2025 13:40:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-23 21:34:14.14034
Title: On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization
Title（参考訳）: コード生成と要約におけるLCM-as-a-judgeの有効性について
Authors: Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, Gabriele Bavota,
Abstract要約: 大規模言語モデルは、最近、Q&Aのような複雑な自然言語処理タスクの裁判官として活用されている。コード生成とコード要約という2つのコード関連タスクに対するLLMs-as-a-judgeの有効性について検討した。
参考スコア（独自算出の注目度）: 54.965787768076254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code summarizers/generators. For example, it is well documented that metrics such as BLEU are quite weak proxies for the quality of the generated summaries. Second, even state-of-the-art techniques still struggle with handling complex instances of these tasks, making them good candidates for benefiting from more advanced solutions envisioning collaboration among LLMs. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with "smaller" LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.
Abstract（参考訳）: 大規模言語モデルは、最近、Q&Aのような複雑な自然言語処理タスクの裁判官として活用されている。基本的な考え方は、自動化されたタスクのための技術によって提供されるアウトプットの“品質”を評価することをLLMに委譲することである。 i) 定量的な指標は、物語の一部しか示さず、 (II)大規模な人的評価は高すぎる。 LLMs-as-a-judgeは、特定のタスクに有効であると証明されたとしても、自動化のための新たな可能性を開くことができる。コード生成とコード要約という2つのコード関連タスクに対するLLMs-as-a-judgeの有効性について検討した。これらのタスクを選択する根拠は2つあります。まず、定量的なメトリクスは通常、コード要約器/ジェネレータの評価には不十分です。例えば、BLEUのようなメトリクスは生成した要約の品質に対する非常に弱いプロキシであることがよく文書化されている。第二に、最先端の技術でさえもこれらのタスクの複雑なインスタンスを扱うのに苦戦しており、LSM間のコラボレーションを想定するより高度なソリューションの恩恵を受けるための良い候補となっている。コード生成では、8つのLLMが1,405のJavaメソッドと1,281のPython関数の正当性を判定できるかどうかを確認する。コード要約では,Java と Python の関数に関連する約1.2k の要約に対して,5つの LLM の判断を 9 人の人間によって提供されるものと比較する。以上の結果から, GPT-4-turbo は両タスクの判定能力において最高の LLM であることが明らかとなった。しかし、最高の性能のLCMでさえ、コードの正確さと要約品質をしばしば誤解します。

関連論文リスト

An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
大規模言語モデル(LLM)は強い多言語能力を示しており、実アプリケーションでM2MS(Multi-to-Many summarization)を実行する可能性を秘めている。本研究は,LLMのM2MS能力に関する系統的研究である。
論文参考訳（メタデータ） (2025-05-19T11:18:54Z)
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
大規模言語モデル(LLM)は、コード生成のような様々なソフトウェアエンジニアリング(SE)タスクに取り組むためにデプロイされている。 Pass@kメトリックは、広範囲なユニットテストと設定された環境を必要とし、LLM生成したテキストの評価には適していない。 BLEUのような従来のメトリクスは、意味的類似性ではなく語彙のみを測定するが、精査されている。
論文参考訳（メタデータ） (2025-02-10T06:49:29Z)
Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues [3.2162648244439684]
本研究では,対話における応答の共感を測り,評価する上で,大規模言語モデルがいかに効果的かを調べるための枠組みを開発する。我々の戦略は、最新かつ微調整されたLLMの性能を明示的で説明可能な特徴で近似することである。以上の結果から,組込みのみを用いる場合,ジェネリックLLMに近い性能が得られることがわかった。
論文参考訳（メタデータ） (2024-12-28T20:37:57Z)
SkillAggregation: Reference-free LLM-Dependent Aggregation [14.46141987797362]
大規模言語モデル(LLM)は、NLPタスクの評価にますます使用される。最近の研究は、審査員が性能を向上させるために複数のLLMを使うことを示唆している。この研究は、参照ラベルが使用できない複数のシステムからの予測を集約することに焦点を当てている。
論文参考訳（メタデータ） (2024-10-14T07:13:47Z)
Source Code Summarization in the Era of Large Language Models [23.715005053430957]
大規模言語モデル(LLM)は、コード関連のタスクのパフォーマンスを大幅に向上させた。本稿では,LLMにおけるコード要約の体系的および包括的研究を行う。
論文参考訳（メタデータ） (2024-07-09T05:48:42Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
BigCodeBenchは、大規模言語モデル(LLM)に対して、139のライブラリと7つのドメインから1140のきめ細かいタスクに対して、複数の関数呼び出しをツールとして呼び出すためのベンチマークである。評価の結果,LLMは機能コールを正確に使用するための複雑な指示に従うことができず,スコアは最大60%,人的性能は97%と極めて低いことがわかった。そこで本研究では,BigCodeBench-Instructという自然言語指向の変種を提案する。
論文参考訳（メタデータ） (2024-06-22T15:52:04Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Benchは、既存のコードリポジトリを利用してタスクを実行する現実世界のプログラミングアプリケーションに根ざしたベンチマークである。 LLM(Large Language Model)とAIエージェントの両方を評価するために、事前に定義されたデプロイメント環境でLLMのテキスト-コード変換を評価するML-LLM-Benchと、Linuxサンドボックス環境でエンドツーエンドのタスク実行で自律エージェントをテストするML-Agent-Benchの2つの設定が採用されている。
論文参考訳（メタデータ） (2023-11-16T12:03:21Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
大規模言語モデル(LLM)は、コード生成において印象的なコンテキスト内学習(ICL)能力を示している。コード生成のためのLAIL (LLM-Aware In-context Learning) という新しい学習ベース選択手法を提案する。
論文参考訳（メタデータ） (2023-10-15T06:12:58Z)
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models [43.655927559990616]
我々は,LLMのプログラミング理解,コード生成,コード修正能力に着目したベンチマークデータセットであるCodeApexを提案する。汎用モデルと特化モデルの両方を含む,広く使用されているLLMを12種類評価した。 GPT-4は最高のプログラミング能力を示し、それぞれ69%、54%、66%の精度を達成している。
論文参考訳（メタデータ） (2023-09-05T04:12:01Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
大規模言語モデル(LLM)は、言語理解と生成において顕著な能力を示している。タスク非依存であり、元のトレーニングデータセットへの依存を最小限に抑えるという2つの制約の範囲内でLLMの圧縮に取り組む。 LLM-Prunerという名前のこの手法は、非臨界結合構造を選択的に除去する構造プルーニングを採用する。
論文参考訳（メタデータ） (2023-05-19T12:10:53Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。