PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
- URL: http://arxiv.org/abs/2601.16669v2
- Date: Wed, 28 Jan 2026 12:26:58 GMT
- Title: PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
- Authors: Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L. A. Clarke, Hu Wei,
- Abstract summary: We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
- Score: 67.71760070255425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
Related papers
- LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions [34.91946661563455]
Large language models (LLMs) are being increasingly integrated into legal applications.<n>This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice.
arXiv Detail & Related papers (2026-01-21T18:51:37Z) - Chinese Labor Law Large Language Model Benchmark [11.552694592413303]
We present LabourLawLLM, a large language model tailored to Chinese labor law.<n>We also introduce LabourLawBench, a benchmark covering diverse labor-law tasks.<n> Experiments show that LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs.
arXiv Detail & Related papers (2026-01-15T01:27:29Z) - LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence [74.05988707492058]
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making.<n>Existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs)<n>We propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs.
arXiv Detail & Related papers (2025-12-04T08:48:02Z) - CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis [13.067377421250557]
Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes.<n>This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning.
arXiv Detail & Related papers (2025-09-25T14:19:51Z) - GLARE: Agentic Reasoning for Legal Judgment Prediction [60.13483016810707]
Legal judgment prediction (LJP) has become increasingly important in the legal field.<n>Existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge.<n>We introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules.
arXiv Detail & Related papers (2025-08-22T13:38:12Z) - LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z) - LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI.
We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
arXiv Detail & Related papers (2023-10-09T11:19:55Z) - LegalBench: A Collaboratively Built Benchmark for Measuring Legal
Reasoning in Large Language Models [15.98468948605927]
LegalBench is a benchmark consisting of 162 tasks covering six different types of legal reasoning.
This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
arXiv Detail & Related papers (2023-08-20T22:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.