Fugu-MT 論文翻訳(概要): Mediocrity is the key for LLM as a Judge Anchor Selection

論文の概要: Mediocrity is the key for LLM as a Judge Anchor Selection

arxiv url: http://arxiv.org/abs/2603.16848v1
Date: Tue, 17 Mar 2026 17:54:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.464776
Title: Mediocrity is the key for LLM as a Judge Anchor Selection
Title（参考訳）: LLMのアンカー・セレクションとしてのメディチュアリティ
Authors: Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend,
Abstract要約: アンカー選択が結果の信頼性に与える影響は、まだ明らかにされていない。貧弱なアンカーは、人間のランキングとの相関を劇的に減らすことができる。信頼性と効率性を確保するために,情報アンカーを選択するためのガイドラインを提供する。
参考スコア（独自算出の注目度）: 28.656244246729184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
Abstract（参考訳）: `LLM-as-a-judge''パラダイムは、オープン・エンド・ジェネレーションを評価する標準的な方法となっている。ペア比較の二次スケーラビリティコストに対処するため、Arena-HardやAlpacaEvalといった一般的なベンチマークでは、すべてのモデルを単一のアンカーと比較している。しかし、広く使われているにもかかわらず、アンカーの選択が結果の信頼性に与える影響は未解明のままである。本研究では,アリーナ-ハード-v2.0データセットに対する22種類のアンカーの評価により,アンカー選択の効果を系統的に検討する。アンカーの選択は重要であり、貧弱なアンカーは人間のランキングとの相関を劇的に減らすことができる。一般的なアンカーの選択(最高のパフォーマンスと最悪のパフォーマンスのモデル)は、アンカーを貧弱にします。これらの極端なアンカーは他のモデルよりも一貫して良いか悪いので、モデルの相対的なランキングを示すことはめったにない。さらに、アンカー選択の効果の大きさを定量化し、判定モデルの選択に匹敵することを示す。私たちは行動可能な勧告で締めくくります。まず、パワー分析を行い、アンカーベース評価に十分なベンチマークサイズを計算し、標準ベンチマークサイズがペアワイズ評価に不十分であり、競争モデルを確実に区別できないことを発見した。第2に,信頼性と効率性を確保するために,情報アンカーを選択するためのガイドラインを提供する。

論文の概要: Mediocrity is the key for LLM as a Judge Anchor Selection

関連論文リスト