Related papers: CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis

CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis

URL: http://arxiv.org/abs/2509.21208v1
Date: Thu, 25 Sep 2025 14:19:51 GMT
Title: CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Authors: Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen,
Abstract summary: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes.<n>This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning.
Score: 13.067377421250557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.

Related papers

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z)
PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z)
Chinese Labor Law Large Language Model Benchmark [11.552694592413303]
We present LabourLawLLM, a large language model tailored to Chinese labor law.<n>We also introduce LabourLawBench, a benchmark covering diverse labor-law tasks.<n> Experiments show that LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs.
arXiv Detail & Related papers (2026-01-15T01:27:29Z)
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics [49.3262123849242]
We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset.<n>We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces.
arXiv Detail & Related papers (2025-11-30T18:32:43Z)
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation [56.79698529022327]
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution.<n>This paper explores the problem of legal claim generation based on the given case's facts.<n>We construct ClaimGen-CN, the first dataset for Chinese legal claim generation task.
arXiv Detail & Related papers (2025-08-24T07:19:25Z)
GLARE: Agentic Reasoning for Legal Judgment Prediction [60.13483016810707]
Legal judgment prediction (LJP) has become increasingly important in the legal field.<n>Existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge.<n>We introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules.
arXiv Detail & Related papers (2025-08-22T13:38:12Z)
Evaluating the Role of Large Language Models in Legal Practice in India [0.0]
The integration of Artificial Intelligence into the legal profession raises significant questions about the capacity of Large Language Models to perform key legal tasks.<n>I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context.<n>I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.
arXiv Detail & Related papers (2025-08-13T11:04:48Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [61.344330783528015]
LEXam is a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
Legal Evalutions and Challenges of Large Language Models [42.51294752406578]
We use the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain.
arXiv Detail & Related papers (2024-11-15T12:23:12Z)
Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models [7.797885529152412]
Large language models (LLMs) have demonstrated remarkable performance in the legal domain. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application.
arXiv Detail & Related papers (2024-10-11T11:41:02Z)
InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z)
Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base [22.740895683854568]
This paper introduces LegalSemi, a benchmark specifically curated for legal scenario analysis.<n>LegalSemi comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework from Malaysian Contract Law.<n>A series of experiments were conducted to assess the usefulness of LegalSemi for IRAC analysis.
arXiv Detail & Related papers (2024-06-19T04:59:09Z)
LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
arXiv Detail & Related papers (2023-10-09T11:19:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.