Fugu-MT 論文翻訳(概要): Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

論文の概要: Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

arxiv url: http://arxiv.org/abs/2601.04424v1
Date: Wed, 07 Jan 2026 22:08:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:52.939886
Title: Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization
Title（参考訳）: Gavel: AgentがLLMを評価するためのチェックリストを公開
Authors: Yao Dou, Wei Xu,
Abstract要約: 大規模言語モデル(LLM)は、最大100万個のトークンのコンテキストをサポートするようになったが、複雑な長文タスクにおけるそれらの有効性はまだ不明である。本研究は,100K-500Kトークンの多文書にまたがる多文書の判例要約について検討する。本稿では,26項目以上の多値チェックリスト評価を行う参照ベース評価フレームワークであるGavel-Refを紹介する。
参考スコア（独自算出の注目度）: 10.935436958494245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.
Abstract（参考訳）: 大規模言語モデル(LLM)は、最大100万個のトークンのコンテキストをサポートするようになったが、複雑な長文タスクにおけるそれらの有効性はまだ不明である。本稿では,100K-500Kトークンの多文書にまたがる多文書の判例要約について検討する。本稿では,26項目以上の多値チェックリスト評価を行う参照ベース評価フレームワークであるGavel-Refを紹介する。 Gavel-Refを用いて、先行研究で報告された1つの集計スコアを超え、主に2025年までの32Kから512Kのトークンを含む100の訴訟に対して、12のフロンティアLSMを体系的に評価する。我々の結果は、最強モデルであるGemini 2.5 Proでさえ、タスクの難しさを強調するために$S_{\text{Gavel-Ref}}$の50しか達成していないことを示している。モデルは、単純なチェックリストアイテム(例えば、申請日)でうまく機能するが、解決や監視レポートのような、多値または稀な項目に苦労する。 LLMは改善を続けており、人間による要約(人間の参照の信頼性を低下させる)を上回る可能性があるため、私たちは、ケース文書から直接チェックリストをナビゲートし抽出する6つのツールを備えた、効率的で自律的なエージェントの足場であるGavel-Agentを開発した。 Qwen3では、Gavel-Agentはトークン使用量を36%削減し、その結果、GPT-4.1によるエンドツーエンド抽出と比較して、$S_{\text{checklist}}$が7%低下した。

論文の概要: Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

関連論文リスト