Related papers: Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

URL: http://arxiv.org/abs/2510.25356v1
Date: Wed, 29 Oct 2025 10:21:25 GMT
Title: Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments
Authors: Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider,
Abstract summary: Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit.<n>This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges.<n>Our investigation in English shows that models do not provide stable interpretive judgments.
Score: 2.8622281002418357
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

Related papers

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z)
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z)
GLARE: Agentic Reasoning for Legal Judgment Prediction [60.13483016810707]
Legal judgment prediction (LJP) has become increasingly important in the legal field.<n>Existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge.<n>We introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules.
arXiv Detail & Related papers (2025-08-22T13:38:12Z)
Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech [3.4300974012019148]
This paper examines different approaches to conditioning Large Language Models (LLMs) at multiple levels of abstraction in legal systems to detect potentially punishable hate speech.<n>We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code.<n>The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned.
arXiv Detail & Related papers (2025-06-03T15:50:27Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
Artificial Intelligence and Legal Analysis: Implications for Legal Education and the Profession [0.0]
This article reports the results of a study examining the ability of legal and nonlegal Large Language Models to perform legal analysis.<n>The results show that LLMs can conduct basic IRAC analysis, but are limited by brief responses lacking detail, an inability to commit to answers, false confidence, and hallucinations.
arXiv Detail & Related papers (2025-02-04T19:50:48Z)
Automating Legal Interpretation with LLMs: Retrieval, Generation, and Evaluation [27.345475442620746]
ATRIE consists of a legal concept interpreter and a legal concept interpretation evaluator.<n>The quality of our interpretations is comparable to those written by legal experts, with superior comprehensiveness and readability.<n>Although there remains a slight gap in accuracy, it can already assist legal practitioners in improving the efficiency of legal interpretation.
arXiv Detail & Related papers (2025-01-03T10:11:38Z)
Legal Evalutions and Challenges of Large Language Models [42.51294752406578]
We use the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain.
arXiv Detail & Related papers (2024-11-15T12:23:12Z)
Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment [53.17596274334017]
We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
arXiv Detail & Related papers (2024-10-06T08:33:39Z)
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval [16.29803062332164]
We propose a few-shot approach where large language models assist in generating expert-aligned relevance judgments.<n>The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators.<n>It also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process.
arXiv Detail & Related papers (2024-03-27T09:46:56Z)
Towards Explainability in Legal Outcome Prediction Models [64.00172507827499]
We argue that precedent is a natural way of facilitating explainability for legal NLP models. By developing a taxonomy of legal precedent, we are able to compare human judges and neural models. We find that while the models learn to predict outcomes reasonably well, their use of precedent is unlike that of human judges.
arXiv Detail & Related papers (2024-03-25T15:15:41Z)
Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction [0.6091702876917281]
Legal syllogism prompting (LoT) is a simple prompting method to teach large language models for legal judgment prediction. LoT teaches only that in the legal syllogism the major premise is law, the minor premise is the fact, and the conclusion is judgment. Our results show that LLMs with LoT achieve better performance than the baseline and chain of thought prompting.
arXiv Detail & Related papers (2023-07-17T08:38:46Z)
Exploiting Contrastive Learning and Numerical Evidence for Confusing Legal Judgment Prediction [46.71918729837462]
Given the fact description text of a legal case, legal judgment prediction aims to predict the case's charge, law article and penalty term. Previous studies fail to distinguish different classification errors with a standard cross-entropy classification loss. We propose a moco-based supervised contrastive learning to learn distinguishable representations. We further enhance the representation of the fact description with extracted crime amounts which are encoded by a pre-trained numeracy model.
arXiv Detail & Related papers (2022-11-15T15:53:56Z)
Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding. We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.