ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts
- URL: http://arxiv.org/abs/2508.03080v1
- Date: Tue, 05 Aug 2025 04:53:05 GMT
- Title: ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts
- Authors: Shuang Liu, Zelong Li, Ruoyun Ma, Haiyan Zhao, Mengnan Du,
- Abstract summary: The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored.<n>This paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts.
- Score: 21.217188970086344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning ("thinking") mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate "no related clause" responses more frequently even when relevant clauses are present. This suggests "laziness" in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.
Related papers
- Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework [7.941114118462577]
Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications.<n>We develop a legally-grounded framework explicitly designed to align LLM outputs with fair-use doctrine.<n>FuA-LLM substantially reduces problematic outputs (up to 20%) compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-25T12:23:26Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM [41.31814587755912]
We propose a knowledge-guided data generation framework for legal reasoning.<n>Our framework enables leveraging legal knowledge to enhance generation diversity and introduces a refinement and verification process.<n>Our trained model LawGPT outperforms existing legal-specific LLMs and achieves performance comparable to proprietary LLMs.
arXiv Detail & Related papers (2025-02-10T15:40:35Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Boosting LLM-based Relevance Modeling with Distribution-Aware Robust Learning [14.224921308101624]
We propose a novel Distribution-Aware Robust Learning framework (DaRL) for relevance modeling.<n>DaRL has been deployed online to serve the Alipay's insurance product search.
arXiv Detail & Related papers (2024-12-17T03:10:47Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - LiCoEval: Evaluating LLMs on License Compliance in Code Generation [27.368667936460508]
Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers.<n>LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production.<n>This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code.
arXiv Detail & Related papers (2024-08-05T14:09:30Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models.
We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z) - Building Real-World Meeting Summarization Systems using Large Language
Models: A Practical Perspective [8.526956860672698]
This paper studies how to effectively build meeting summarization systems for real-world usage using large language models (LLMs)
Our findings reveal that most closed-source LLMs are generally better in terms of performance.
Much smaller open-source models like LLaMA- 2 (7B and 13B) could still achieve performance comparable to the large closed-source models even in zero-shot scenarios.
arXiv Detail & Related papers (2023-10-30T02:25:21Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.