Fugu-MT 論文翻訳(概要): LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

論文の概要: LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

arxiv url: http://arxiv.org/abs/2511.04205v1
Date: Thu, 06 Nov 2025 09:11:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.374094
Title: LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Title（参考訳）: LLM-as-a-Judgeは悪い, ポーランド国家控訴委員会の審査員資格試験に基づくAI
Authors: Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański,
Abstract要約: 本報告では,公益調達法に関する知識試験と判決書を含む試験の構造について述べる。いくつかのLCMはクローズドブックと様々なRetrieval-Augmented Generation設定でテストされた。その結果,本モデルは知識テストで満足度を達成できたが,実用書面の通過しきい値には達しなかった。
参考スコア（独自算出の注目度）: 34.008574054602356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
Abstract（参考訳）: 本研究は,ポーランド国民控訴院(Krajowa Izba Odwo{\l}awcza)の会員資格試験に現行の大規模言語モデル(LLM)が合格できるかどうかを実証的に評価する。著者らは、LLMを実際の試験候補として使用することと、モデル生成された回答を他のモデルで自動的に評価する'LLM-as-a-judge'アプローチの適用の2つの関連考えについて検討した。本稿では,公共調達法に関する複数選択知識テストと書面判断を含む試験構造について述べるとともに,これらのモデルを支援するために構築されたハイブリッド情報回復・抽出パイプラインについて述べる。いくつかのLCM(GPT-4.1、Claude 4 Sonnet、Bielik-11B-v2.6など)がクローズドブックと様々なRetrieval-Augmented Generation設定でテストされた。その結果, モデルが知識テストで満足度を達成できたものの, 実用書面の通過しきい値には達せず, 「LLM-as-a-judge」の評価は, 公式審査委員会の判断から逸脱することが多かった。著者らは、幻覚への感受性、法的規定の誤った引用、論理的議論の弱点、法の専門家と技術チームの緊密な協力の必要性など、重要な制限を強調している。この結果は、急速な技術進歩にもかかわらず、現在のLLMは、ポーランドの公共調達の判断において、人間の裁判官や独立した検査官を置き換えることはできないことを示唆している。

論文の概要: LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

関連論文リスト