LEXam: Benchmarking Legal Reasoning on 340 Law Exams
- URL: http://arxiv.org/abs/2505.12864v5
- Date: Thu, 23 Oct 2025 19:18:23 GMT
- Title: LEXam: Benchmarking Legal Reasoning on 340 Law Exams
- Authors: Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus,
- Abstract summary: We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
- Score: 76.3521146499006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce \textsc{LEXam}, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. We have open-sourced our code on https://github.com/LEXam-Benchmark/LEXam and released our data on https://huggingface.co/datasets/LEXam-Benchmark/LEXam. Project page: https://lexam-benchmark.github.io.
Related papers
- PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z) - LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence [74.05988707492058]
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making.<n>Existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs)<n>We propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs.
arXiv Detail & Related papers (2025-12-04T08:48:02Z) - Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning [0.5308136763388956]
We use India's public legal examinations as a transparent proxy.<n>Our benchmark assembles objective screens from top national and state exams.<n>We also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court's Advocate-on-Record exam.
arXiv Detail & Related papers (2025-10-19T10:04:29Z) - KoBLEX: Open Legal Question Answering with Multi-hop Reasoning [12.122913185860634]
We introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX)<n>KoBLEX is designed to evaluate provision-grounded, multi-hop legal reasoning.<n>We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR)
arXiv Detail & Related papers (2025-09-01T10:07:00Z) - MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z) - RLJP: Legal Judgment Prediction via First-Order Logic Rule-enhanced with Large Language Models [58.69183479148083]
Legal Judgment Prediction (LJP) is a pivotal task in legal AI.<n>Existing LJP models integrate judicial precedents and legal knowledge for high performance.<n>But they neglect legal reasoning logic, a critical component of legal judgments requiring rigorous logical analysis.<n>This paper proposes a rule-enhanced legal judgment prediction framework based on first-order logic (FOL) formalism and comparative learning (CL)
arXiv Detail & Related papers (2025-05-27T14:50:21Z) - AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction [56.797874973414636]
AnnoCaseLaw is a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases.<n>Our dataset lays the groundwork for more human-aligned, explainable Legal Judgment Prediction models.<n>Results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult.
arXiv Detail & Related papers (2025-02-28T19:14:48Z) - LegalBench.PT: A Benchmark for Portuguese Law [17.554201334646056]
We present LegalBench.PT, the first comprehensive legal benchmark covering key areas of Portuguese law.<n>We first collect long-form questions and answers from real law exams, and then use GPT-4o to convert them into multiple-choice, true/false, and matching formats.
arXiv Detail & Related papers (2025-02-22T21:07:12Z) - Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study [9.30538764385435]
Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored.<n>We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations.<n>We then conduct a systematic benchmarking across a range of solutions.<n>Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero.
arXiv Detail & Related papers (2024-12-09T07:46:14Z) - Legal Evalutions and Challenges of Large Language Models [42.51294752406578]
We use the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions.
We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain.
arXiv Detail & Related papers (2024-11-15T12:23:12Z) - LiveBench: A Challenging, Contamination-Limited LLM Benchmark [93.57775429120488]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.<n>We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size.<n>Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK
Case Law Dataset [0.0]
This study addresses the gap in the literature working with large legal corpora about how to isolate cases, in our case summary judgments, from a large corpus of UK court decisions.
We use the Cambridge Law Corpus of 356,011 UK court decisions and determine that the large language model achieves a weighted F1 score of 0.94 versus 0.78 for keywords.
We identify and extract 3,102 summary judgment cases, enabling us to map their distribution across various UK courts over a temporal span.
arXiv Detail & Related papers (2024-03-04T10:13:30Z) - A Comprehensive Evaluation of Large Language Models on Legal Judgment
Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications.
Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks.
We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z) - Large Language Models can Learn Rules [106.40747309894236]
We present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with large language models (LLMs)<n> Experiments on relational reasoning, numerical reasoning and concept learning problems show that HtT improves existing prompting methods.<n>The learned rules are also transferable to different models and to different forms of the same problem.
arXiv Detail & Related papers (2023-10-10T23:07:01Z) - Interpretable Long-Form Legal Question Answering with
Retrieval-Augmented Large Language Models [10.834755282333589]
Long-form Legal Question Answering dataset comprises 1,868 expert-annotated legal questions in the French language.
Our experimental results demonstrate promising performance on automatic evaluation metrics.
As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains.
arXiv Detail & Related papers (2023-09-29T08:23:19Z) - The Legal Argument Reasoning Task in Civil Procedure [2.079168053329397]
We present a new NLP task and dataset from the domain of the U.S. civil procedure.
Each instance of the dataset consists of a general introduction to the case, a particular question, and a possible solution argument.
arXiv Detail & Related papers (2022-11-05T17:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.