Related papers: Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

URL: http://arxiv.org/abs/2503.16040v1
Date: Thu, 20 Mar 2025 11:14:39 GMT
Title: Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond
Authors: Yaoyao Yu, Leilei Gan, Yinghao Hu, Bin Wei, Kun Kuang, Fei Wu,
Abstract summary: Test-Time Scaling Large Language Models (LLMs) have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning.<n>We present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks.<n>Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking.
Score: 29.03425022434831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80\% on seven Chinese legal reasoning tasks and below 80\% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.

Related papers

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments [24.249035670782092]
We introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents.<n>It comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity.<n>We also introduce J1-EVAL, a fine-grained evaluation framework, to assess both task performance and procedural compliance.
arXiv Detail & Related papers (2025-07-05T13:31:21Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [61.344330783528015]
LEXam is a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain [12.550611136062722]
We propose a method of legal knowledge injection attacks for robustness testing. The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks. We have collected mistakes that legal experts might make in judicial decisions in the real world.
arXiv Detail & Related papers (2025-03-24T05:42:05Z)
LexPro-1.0 Technical Report [19.83460019437367]
We introduce our first-generation reasoning model, LexPro-1.0, a large language model designed for the highly specialized Chinese legal domain.<n>To address this, we first compile millions of legal documents covering over 20 types of crimes from 31 provinces in China for model training.<n>The model further undergoes large-scale reinforcement learning without additional supervision, emphasizing the enhancement of its reasoning capabilities and explainability.
arXiv Detail & Related papers (2025-03-10T05:54:23Z)
AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction [56.797874973414636]
AnnoCaseLaw is a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases.<n>Our dataset lays the groundwork for more human-aligned, explainable Legal Judgment Prediction models.<n>Results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult.
arXiv Detail & Related papers (2025-02-28T19:14:48Z)
LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Legal Evalutions and Challenges of Large Language Models [42.51294752406578]
We use the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain.
arXiv Detail & Related papers (2024-11-15T12:23:12Z)
InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z)
PARAMANU-AYN: Pretrain from scratch or Continual Pretraining of LLMs for Legal Domain Adaptation? [3.9018931027384056]
Paramanu-Ayn is a collection of legal language models trained exclusively on Indian legal case documents. Paramanu-Ayn was pretrained from scratch with a context size of 8192 on a single GPU for just 185 hours.
arXiv Detail & Related papers (2024-03-20T15:39:54Z)
BLT: Can Large Language Models Handle Basic Legal Text? [44.89873147675516]
GPT-4 and Claude perform poorly on basic legal text handling. Poor performance on benchmark casts into doubt their reliability as-is for legal practice. Fine-tuning on training set brings even a small model to near-perfect performance.
arXiv Detail & Related papers (2023-11-16T09:09:22Z)
Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI. Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems. Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z)
LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
arXiv Detail & Related papers (2023-10-09T11:19:55Z)
Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding. We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.