Fugu-MT 論文翻訳(概要): TW-LegalBench: Measuring Taiwanese Legal Understanding

論文の概要: TW-LegalBench: Measuring Taiwanese Legal Understanding

arxiv url: http://arxiv.org/abs/2606.18699v1
Date: Wed, 17 Jun 2026 05:25:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.016666
Title: TW-LegalBench: Measuring Taiwanese Legal Understanding
Title（参考訳）: TW-LegalBench:台湾の法律理解を測る
Authors: Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang,
Abstract要約: 大規模言語モデル(LLM)は、様々なタスクにまたがる印象的な能力を示しているが、管轄に固有の法的理由に関するそれらのパフォーマンスは、未調査のままである。本稿では,台湾の法体系の豊富な公式コーパスを利用して,台湾法におけるLLMの評価のギャップを埋めるTW-LegalBenchについて述べる。
参考スコア（独自算出の注目度）: 1.8953545128709688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なタスクにまたがる印象的な能力を示しているが、管轄に固有の法的理由に関するそれらのパフォーマンスは、未調査のままである。本稿では,台湾法体系の豊富な公式コーパスを利用したTW-LegalBenchについて,台湾法におけるLLMの評価のギャップを埋めるために公開している。 TW-LegalBenchは、(1)18の専門分野における5年間の公的な試験で16,000件以上の多重選択質問(MCQ)、(2)公的なスコアリングの専門職試験から117件のオープンエンドエッセイ質問(OEQ)、(3)数百の犯罪カテゴリをカバーする14,000件以上の法的判断予測(LJP)の3つのタスクタイプから構成される。我々は,MCQの精度,OEQのスコアリングルーリック点に基づく分解LDM-as-Judgeフレームワーク,およびLJPのセンテンシング精度と規則引用の指標を用いて13のLCMを評価した。以上の結果から,有資格弁護士の合格率(合格率:11%)を上回り,審査員や検察官の合格率:1〜2%)よりも低い結果が得られた。 LJPでは、モデルが妥当な評定型精度と文予測能力を示す一方で、正確な法的記事を引き出すのに苦労している。これらの結果から,認定試験の成績が人間レベルに近づいたとしても,LLMにとって信頼性の高い法的テキスト生成は依然として困難であることが示唆された。

論文の概要: TW-LegalBench: Measuring Taiwanese Legal Understanding

関連論文リスト