Fugu-MT 論文翻訳(概要): LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

論文の概要: LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

arxiv url: http://arxiv.org/abs/2605.10186v1
Date: Mon, 11 May 2026 08:37:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.65286
Title: LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
Title（参考訳）: LegalCiteBench: 法的言語モデルにおけるCitation Reliabilityの評価
Authors: Sijia Chen, Hang Yin, Shunfan Zhou,
Abstract要約: LegalCiteBenchは、法律言語モデルにおけるクローズドブックの引用回復、引用検証、ケースマッチングを研究するためのベンチマークである。このベンチマークは、引用検索、引用完了、引用エラー検出、ケースマッチング、ケース検証と修正の5つの引用中心タスクをカバーしている。
参考スコア（独自算出の注目度）: 14.281332347684872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.
Abstract（参考訳）: 大規模言語モデル(LLM)は、法的起草や研究のワークフローにますます統合されており、誤った引用や製造された前例が深刻な専門的損害を引き起こす可能性がある。既存の法的なベンチマークは、法的な推論、契約の理解、あるいは一般的な法的質問に対する回答に重点を置いているが、それらは中央のコモン・ローの失敗モードを直接研究していない。本稿では,法言語モデルにおけるクローズドブック引用回復,引用検証,ケースマッチングのベンチマークであるLegalCiteBenchを紹介する。 LegalCiteBenchには、ケース・ロー・アクセシビリティ・プロジェクト(Case Law Access Project)から1,000人の実際のアメリカの司法意見から作られた約24Kの評価事例が含まれている。このベンチマークは、引用検索、引用完了、引用エラー検出、ケースマッチング、ケース検証と修正の5つの引用中心タスクをカバーしている。 21 LLM全体で、このクローズドブック設定では正確な引用回復は非常に困難であり、最も強いモデルでさえ、引用の検索と完了に関して7/100以下である。評価されたモデルの中では、スケールと法的なドメイン事前訓練は限られた利得を提供し、この難しさを解決しない。また, 提案手法では, 評価された21のモデルのうち, 20のモデルに対して, 誤解解答率 (MAR) が94%を超えているため, 誤解解答率 (MAR) が低いオーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーバラップ・オーダを頻繁に提供する。プロンプトのみの禁忌実験は、明示的な不確実性命令によってある程度の確実な製造が減少するが、引用の正しさは改善しないことを示している。 LegalCiteBenchは、外部接地が不完全、不完全、またはバイパスされた場合の権限発生の失敗、検証動作、棄権を研究するための診断フレームワークとして意図されている。

論文の概要: LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

関連論文リスト