Fugu-MT 論文翻訳(概要): The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

論文の概要: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

arxiv url: http://arxiv.org/abs/2509.26072v1
Date: Tue, 30 Sep 2025 10:48:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.106164
Title: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Title（参考訳）: LLM-as-a-Judgeの無罪判決
Authors: Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah,
Abstract要約: 大規模言語モデル(LLM)は、要約、対話、創造的執筆といったタスクにおいてシステム出力を評価する自動判断器として、ますます多くデプロイされている。提案手法では,現行のLLM審査員は,プロンプトに導入したショートカットに頼って,両方のカウントでフェールすることを示す。
参考スコア（独自算出の注目度）: 17.555073770285095
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.
Abstract（参考訳）: 大規模言語モデル(LLM)は、要約、対話、創造的執筆といったタスクにおいてシステム出力を評価する自動判断器として、ますます多くデプロイされている。忠実な裁判官は、判断を応答品質のみに基づいて下し、決定を形作る要因を明確に認めなければならない。提案手法では,現行のLCM審査員は,プロンプトに導入したショートカットに頼って,両方のカウントでフェールすることを示す。 ELI5(長文質問応答のベンチマーク)と、最近のクリエイティブな文章のベンチマークであるLitBench(リンク)の2つの評価データセットを使用しました。どちらのデータセットもペア比較を提供しており、評価者は2つのレスポンスのどちらがよいかを選択する必要がある。各データセットから、100対の判定タスクを構築し、LPM-as-a-judgeの役割を評価するために、GPT-4oとGemini-2.5-Flashという2つの広く使われているモデルを使用します。各ペアに対して, 応答, 情報源の同一性を示す証明的手がかり (Human, Expert, LLM, Unknown) , 時間的起源を示す回帰的手がかり (Old, 1950 vs. New, 2025) を割り当て, 残りのプロンプトを固定したままにしておく。どちらのモデルも、古いものよりも新しい反応を体系的に好んでおり、明確な証明階層(Expert > Human > LLM > Unknown)も持っている。これらのバイアスは特に GPT-4o やより主観的でオープンな LitBench ドメインで顕著である。正当化は、注入された手がかりをほとんど参照せず、代わりにコンテンツ品質の観点から決定を合理化します。これらの結果から,現在のLCM-as-a-judgeシステムは短命で不誠実であり,研究と展開の両面での信頼性を損なうことが明らかとなった。

論文の概要: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

関連論文リスト