Related papers: Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

URL: http://arxiv.org/abs/2511.00340v1
Date: Sat, 01 Nov 2025 00:51:21 GMT
Title: Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
Authors: Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand, Vivek Gupta,
Abstract summary: CLAUSE is a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning.<n>Our work outlines a path to identify and correct such reasoning failures in legal AI.
Score: 15.35489310097019
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

Related papers

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z)
Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models [8.769542756426786]
We introduce M SLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making.<n>M SLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents.<n>We design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations.<n>Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts.
arXiv Detail & Related papers (2025-11-11T08:45:29Z)
LLMs as verification oracles for Solidity [1.3887048755037537]
This paper provides the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role.<n>We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios.<n>Our study suggests a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.
arXiv Detail & Related papers (2025-09-23T15:32:13Z)
SoK: Large Language Model Copyright Auditing via Fingerprinting [69.14570598973195]
We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches.<n>We propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios.
arXiv Detail & Related papers (2025-08-27T12:56:57Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation [5.732271982985626]
AutoLaw is a novel violation detection framework for domain-specific large language models (LLMs)<n>It combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs.<n>Our results highlight the framework's ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments.
arXiv Detail & Related papers (2025-05-20T07:09:13Z)
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain [12.550611136062722]
We propose a method of legal knowledge injection attacks for robustness testing.<n>The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks.<n>We have collected mistakes that legal experts might make in judicial decisions in the real world.
arXiv Detail & Related papers (2025-03-24T05:42:05Z)
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning [34.427730009102966]
We develop an automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs.<n>Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
arXiv Detail & Related papers (2025-02-08T19:49:32Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications. Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.