Related papers: LawBench: Benchmarking Legal Knowledge of Large Language Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

URL: http://arxiv.org/abs/2309.16289v1
Date: Thu, 28 Sep 2023 09:35:59 GMT
Title: LawBench: Benchmarking Legal Knowledge of Large Language Models
Authors: Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge
Abstract summary: Large language models (LLMs) have demonstrated strong capabilities in various aspects. It is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels.
Score: 35.2812008533622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

Related papers

LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Scaling Laws for Fact Memorization of Large Language Models [67.94080978627363]
We analyze the scaling laws for Large Language Models' fact knowledge and their behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs. Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.
arXiv Detail & Related papers (2024-06-22T03:32:09Z)
InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
BLT: Can Large Language Models Handle Basic Legal Text? [44.89873147675516]
GPT-4 and Claude perform poorly on basic legal text handling. Poor performance on benchmark casts into doubt their reliability as-is for legal practice. Fine-tuning on training set brings even a small model to near-perfect performance.
arXiv Detail & Related papers (2023-11-16T09:09:22Z)
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications. Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z)
LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
arXiv Detail & Related papers (2023-10-09T11:19:55Z)
A Short Survey of Viewing Large Language Models in Legal Aspect [0.0]
Large language models (LLMs) have transformed many fields, including natural language processing, computer vision, and reinforcement learning. The integration of LLMs into the legal field has also raised several legal problems, including privacy concerns, bias, and explainability.
arXiv Detail & Related papers (2023-03-16T08:01:22Z)
Legal Prompt Engineering for Multilingual Legal Judgement Prediction [2.539568419434224]
Legal Prompt Engineering (LPE) or Legal Prompting is a process to guide and assist a large language model (LLM) with performing a natural legal language processing skill. We investigate the performance of zero-shot LPE for given facts in case-texts from the European Court of Human Rights (in English) and the Federal Supreme Court of Switzerland (in German, French and Italian)
arXiv Detail & Related papers (2022-12-05T12:17:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.