Fugu-MT 論文翻訳(概要): SLMJury: Can Small Language Models Judge as Well as Large Ones?

論文の概要: SLMJury: Can Small Language Models Judge as Well as Large Ones?

arxiv url: http://arxiv.org/abs/2606.07810v1
Date: Fri, 05 Jun 2026 19:38:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.460547
Title: SLMJury: Can Small Language Models Judge as Well as Large Ones?
Title（参考訳）: SLMJury: 小さい言語モデルは大きなもののように判断できますか?
Authors: Anish Laddha, Nitesh Pradhan, Gaurav Srivastava,
Abstract要約: SLMJury(Small Language Model (SLM) 評価フレームワーク)を2つのパラダイムの審査対象として紹介する。我々は10のベンチマークで4つのモデルファミリーから16のSLM判定値(0.6B-14Bパラメータ)をベンチマークした。信頼性の高い自動評価は大きなプロプライエタリなモデルを必要としないが、単一のSLMが支配的ではない。
参考スコア（独自算出の注目度）: 1.5990700377571574
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.
Abstract（参考訳）: 大規模言語モデル(LLM)は、モデル出力を評価するために広く使用されるが、その高コスト、レイテンシ、不透明性はスケーラビリティを制限する。 SLMJury(Small Language Model (SLM) 評価フレームワーク)は,2つのパラダイム – 閉じた二項正当性とオープンな品質スコアリング – を審査対象として導入する。数学,科学,一般的な推論(N=64,824の判定)にまたがる8つのクローズドエンドタスク(N=64,824の判定)と、要約と会話のためのSummEvalとMT-Benchの4つのモデルファミリーから16のSLM判定値(0.6B-14Bのパラメータ)をベンチマークした。予算条件付き関数として判断を定式化し、5次元の研究を行う。 4つの発見がある。 1) 過剰思考効果は領域依存的であり, 数学的判断(助力所で2～7%)で10件の評定が一致したり, 上回ったり, 上回ったり, 上回ったり, 上回ったりするが, 一般のタスクでは最大23%の勝率を示した。 2) 領域一般化はモデルファミリを分離し, 数学と一般の精度のギャップは10%未満から40%近くである。ベストバイナリ・ジャッジ(Phi-4)は、MT-Benchでランク9に低下し、推論訓練されたモデルは、この順序を逆転する。 (4)Reflection-Critique-Refine(RCR)の議論プロトコルでは、マルチエージェントの議論は全てのテスト構成で精度を低下させ、上位の審査員は6人の対人格に<=0.55%のばらつきで抵抗する。信頼性の高い自動評価は大きなプロプライエタリなモデルを必要としないが、単一のSLMが支配的ではない。リーダボードはhttps://anishh15.github.io/SLMJury/で、フレームワークコードとpipパッケージはhttps://github.com/anishh15/SLMJuryとhttps://pypi.org/project/slmjury/で公開されています。

論文の概要: SLMJury: Can Small Language Models Judge as Well as Large Ones?

関連論文リスト