Fugu-MT 論文翻訳(概要): Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

論文の概要: Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

arxiv url: http://arxiv.org/abs/2605.08437v1
Date: Fri, 08 May 2026 20:00:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.643578
Title: Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Title（参考訳）: Magis-Bench: 行政レベルの法的タスクにおけるLCMの評価
Authors: Ramon Pires, Thales Sales Almeida, Celio Larcher Junior, Giovana Bonás, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Thiago Laitz, Rodrigo Nogueira,
Abstract要約: Magis-Benchは、管理レベルの書き込みタスクを評価するためのベンチマークである。 2023年から2025年にかけて行われた8回の試験から74の質問がある。 LLM-as-a-judge法を用いて23種類のLLMの評価を行った。
参考スコア（独自算出の注目度）: 8.678622777553263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $τ\ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.
Abstract（参考訳）: 既存の法律AIのベンチマークは、LLMが法的議論や文書を作成しなければならないタスクに重点を置いているが、そのような議論 -- 競合する主張の重み付け、事実への原則の適用、合理的な決定 -- は、アドボケーシのように機能する法体系の根本的存在であることは間違いない。マギスベンチ(Magis-Bench)は、ブラジルで最近行われた司法職の競争試験から得られた、行政レベルの文書作成タスクのLCMを評価するためのベンチマークである。マギス・ベンチは、2023年から2025年にかけて行われた8回の試験から74の質問で構成されており、多ターン構造による不正確な法的分析の質問と、完全な民事および刑事司法裁判所の構成を必要とする実践的な演習を含んでいる。 LLM-as-a-judge法と4つの独立したフロンティアモデルを用いて,23の最先端LCMを評価した。我々の結果は、GoogleのGemini-3-Pro-Previewが最高スコア(6.97/10)を獲得し、続いてGemini-3-Flash-Preview(6.67)とClaude-4.5-Opus(6.46)が続いた(Kendall's $W = 0.984$, pairwise Kendall's $τ\ge 0.897$)。最高のパフォーマンスモデルでさえ、最大値の70%以下であり、司法レベルの法的理由づけと執筆が現在のLLMにとって困難なままであることを示している。我々は、法的なAI能力に関するさらなる研究を支援するために、完全なベンチマーク、モデル出力、評価コードをリリースする。

論文の概要: Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

関連論文リスト