Fugu-MT 論文翻訳(概要): Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

論文の概要: Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

arxiv url: http://arxiv.org/abs/2604.06996v1
Date: Wed, 08 Apr 2026 12:13:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.510428
Title: Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Title（参考訳）: ルーブリックに基づく大規模言語モデルの評価における自己選好バイアス
Authors: José Pombal, Ricardo Rei, André F. T. Martins,
Abstract要約: 本研究は,ルーリック評価における自己参照バイアス(SPB)の最初の研究である。評価基準が完全に客観的である場合でもSPBは持続することを示す。この環境でSPBを駆動する要因を解析し、負のルーリック、極端なルーリックの長さ、緊急紹介のような主観的なトピックが特に影響を受けやすいことを発見した。
参考スコア（独自算出の注目度）: 24.994793163290737
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.
Abstract（参考訳）: LLM-as-a-judgeは、LCM出力を評価するデファクトアプローチとなっている。しかし、裁判官は自己選好バイアス(SPB: Self-preference bias)を示すことが知られている。これは評価を歪ませ、特に再帰的な自己改善の設定において、モデル開発を妨げます。本稿では,SPBを総合的なスコアやランキングを割り当てる代わりに,個々の評価基準に基づいて二項判定を発行するベンチマークパラダイムである,ルーブリックに基づく評価におけるSPBの最初の研究について述べる。評価基準が完全に客観的である場合でもSPBは持続することを示す。ジェネレータが故障したルーリックの中では、審査員は出力が自分自身の場合に満足していると誤ってマークする可能性が最大50%高い。また、他の評価パラダイムと同様に、複数の審査員を集結させることはSPBを緩和するが、完全に排除しない。主観的ルーリックを持つ医療チャットベンチマークであるHealthBenchでは、SPBがモデルスコアを最大10ポイントスキューし、フロンティアモデルのランク付けにおいて決定的なマージンとなる可能性があることを観察した。この環境でSPBを駆動する要因を解析し、負のルーリック、極端なルーリックの長さ、緊急紹介のような主観的なトピックが特に影響を受けやすいことを発見した。

論文の概要: Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

関連論文リスト