A Comprehensive Evaluation of Large Language Models on Legal Judgment
  Prediction
        - URL: http://arxiv.org/abs/2310.11761v1
- Date: Wed, 18 Oct 2023 07:38:04 GMT
- Title: A Comprehensive Evaluation of Large Language Models on Legal Judgment
  Prediction
- Authors: Ruihao Shui, Yixin Cao, Xiang Wang and Tat-Seng Chua
- Abstract summary: Large language models (LLMs) have demonstrated great potential for domain-specific applications.
Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks.
We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
- Score: 60.70089334782383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Large language models (LLMs) have demonstrated great potential for
domain-specific applications, such as the law domain. However, recent disputes
over GPT-4's law evaluation raise questions concerning their performance in
real-world legal tasks. To systematically investigate their competency in the
law, we design practical baseline solutions based on LLMs and test on the task
of legal judgment prediction. In our solutions, LLMs can work alone to answer
open questions or coordinate with an information retrieval (IR) system to learn
from similar cases or solve simplified multi-choice questions. We show that
similar cases and multi-choice options, namely label candidates, included in
prompts can help LLMs recall domain knowledge that is critical for expertise
legal reasoning. We additionally present an intriguing paradox wherein an IR
system surpasses the performance of LLM+IR due to limited gains acquired by
weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes
redundant. Our evaluation pipeline can be easily extended into other tasks to
facilitate evaluations in other domains. Code is available at
https://github.com/srhthu/LM-CompEval-Legal
 
      
        Related papers
        - J&H: Evaluating the Robustness of Large Language Models Under   Knowledge-Injection Attacks in Legal Domain [12.550611136062722]
 We propose a method of legal knowledge injection attacks for robustness testing.
The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks.
We have collected mistakes that legal experts might make in judicial decisions in the real world.
 arXiv  Detail & Related papers  (2025-03-24T05:42:05Z)
- Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning [34.427730009102966]
 We develop an automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs.
Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
 arXiv  Detail & Related papers  (2025-02-08T19:49:32Z)
- RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World   Scenarios [58.90106984375913]
 RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
 arXiv  Detail & Related papers  (2024-12-12T06:08:46Z)
- LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [21.601196380989542]
 ''LLMs-as-judges'' are evaluators based on natural language responses.
This paper presents a comprehensive survey of the ''LLMs-as-judges'' paradigm from five key perspectives.
We aim to provide insights on the development and application of ''LLMs-as-judges'' in both research and practice.
 arXiv  Detail & Related papers  (2024-12-07T08:07:24Z)
- Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning   with Insights from Multi-Agent Collaboration [27.047809869136458]
 Large Language Models (LLMs) could struggle to fully understand legal theories and perform legal reasoning tasks.
We introduce a challenging task (confusing charge prediction) to better evaluate LLMs' understanding of legal theories and reasoning capabilities.
We also propose a novel framework: Multi-Agent framework for improving complex Legal Reasoning capability.
 arXiv  Detail & Related papers  (2024-10-03T14:15:00Z)
- Knowledge-Infused Legal Wisdom: Navigating LLM Consultation through the   Lens of Diagnostics and Positive-Unlabeled Reinforcement Learning [19.55121050697779]
 We propose the Diagnostic Legal Large Language Model (D3LM), which utilizes adaptive lawyer-like diagnostic questions to collect additional case information.
D3LM incorporates an innovative graph-based Positive-Unlabeled Reinforcement Learning (PURL) algorithm, enabling the generation of critical questions.
Our research also introduces a new English-language CVG dataset based on the US case law database.
 arXiv  Detail & Related papers  (2024-06-05T19:47:35Z)
- DnA-Eval: Enhancing Large Language Model Evaluation through   Decomposition and Aggregation [75.81096662788254]
 Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
 arXiv  Detail & Related papers  (2024-05-24T08:12:30Z)
- Exploring the Nexus of Large Language Models and Legal Systems: A Short   Survey [1.0770079992809338]
 The capabilities of Large Language Models (LLMs) are increasingly demonstrating unique roles in the legal sector.
This survey delves into the synergy between LLMs and the legal system, such as their applications in tasks like legal text comprehension, case retrieval, and analysis.
The survey showcases the latest advancements in fine-tuned legal LLMs tailored for various legal systems, along with legal datasets available for fine-tuning LLMs in various languages.
 arXiv  Detail & Related papers  (2024-04-01T08:35:56Z)
- Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When   and What to Retrieve for LLMs [60.40396361115776]
 This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
 arXiv  Detail & Related papers  (2024-02-19T11:11:08Z)
- Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
 Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks.
The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human.
These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
 arXiv  Detail & Related papers  (2024-01-30T17:38:54Z)
- Rephrase and Respond: Let Large Language Models Ask Better Questions for   Themselves [57.974103113675795]
 We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
 arXiv  Detail & Related papers  (2023-11-07T18:43:34Z)
- LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
 General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI.
We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
 arXiv  Detail & Related papers  (2023-10-09T11:19:55Z)
- Investigating the Factual Knowledge Boundary of Large Language Models   with Retrieval Augmentation [109.8527403904657]
 We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
 arXiv  Detail & Related papers  (2023-07-20T16:46:10Z)
- Large Language Models as Tax Attorneys: A Case Study in Legal
  Capabilities Emergence [5.07013500385659]
 This paper explores Large Language Models' (LLMs) capabilities in applying tax law.
Our experiments demonstrate emerging legal understanding capabilities, with improved performance in each subsequent OpenAI model release.
Findings indicate that LLMs, particularly when combined with prompting enhancements and the correct legal texts, can perform at high levels of accuracy but not yet at expert tax lawyer levels.
 arXiv  Detail & Related papers  (2023-06-12T12:40:48Z)
- Encouraging Divergent Thinking in Large Language Models through   Multi-Agent Debate [85.3444184685235]
 We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
 arXiv  Detail & Related papers  (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.