LawBench: Benchmarking Legal Knowledge of Large Language Models
- URL: http://arxiv.org/abs/2309.16289v1
- Date: Thu, 28 Sep 2023 09:35:59 GMT
- Title: LawBench: Benchmarking Legal Knowledge of Large Language Models
- Authors: Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang
Zhang, Kai Chen, Zongwen Shen, Jidong Ge
- Abstract summary: Large language models (LLMs) have demonstrated strong capabilities in various aspects.
It is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks.
LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels.
- Score: 35.2812008533622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated strong capabilities in various
aspects. However, when applying them to the highly specialized, safe-critical
legal domain, it is unclear how much legal knowledge they possess and whether
they can reliably perform legal-related tasks. To address this gap, we propose
a comprehensive evaluation benchmark LawBench. LawBench has been meticulously
crafted to have precise assessment of the LLMs' legal capabilities from three
cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize
needed legal concepts, articles and facts; (2) Legal knowledge understanding:
whether LLMs can comprehend entities, events and relationships within legal
text; (3) Legal knowledge applying: whether LLMs can properly utilize their
legal knowledge and make necessary reasoning steps to solve realistic legal
tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label
classification (SLC), multi-label classification (MLC), regression, extraction
and generation. We perform extensive evaluations of 51 LLMs on LawBench,
including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific
LLMs. The results show that GPT-4 remains the best-performing LLM in the legal
domain, surpassing the others by a significant margin. While fine-tuning LLMs
on legal specific text brings certain improvements, we are still a long way
from obtaining usable and reliable LLMs in legal tasks. All data, model
predictions and evaluation code are released in
https://github.com/open-compass/LawBench/. We hope this benchmark provides
in-depth understanding of the LLMs' domain-specified capabilities and speed up
the development of LLMs in the legal domain.
Related papers
- Scaling Laws for Fact Memorization of Large Language Models [67.94080978627363]
We analyze the scaling laws for Large Language Models' fact knowledge and their behaviors of memorizing different types of facts.
We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs.
Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.
arXiv Detail & Related papers (2024-06-22T03:32:09Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - BLT: Can Large Language Models Handle Basic Legal Text? [44.89873147675516]
GPT-4 and Claude perform poorly on basic legal text handling.
Poor performance on benchmark casts into doubt their reliability as-is for legal practice.
Fine-tuning on training set brings even a small model to near-perfect performance.
arXiv Detail & Related papers (2023-11-16T09:09:22Z) - A Comprehensive Evaluation of Large Language Models on Legal Judgment
Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications.
Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks.
We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z) - LAiW: A Chinese Legal Large Language Models Benchmark [17.66376880475554]
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI.
We are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice.
arXiv Detail & Related papers (2023-10-09T11:19:55Z) - A Short Survey of Viewing Large Language Models in Legal Aspect [0.0]
Large language models (LLMs) have transformed many fields, including natural language processing, computer vision, and reinforcement learning.
The integration of LLMs into the legal field has also raised several legal problems, including privacy concerns, bias, and explainability.
arXiv Detail & Related papers (2023-03-16T08:01:22Z) - Legal Prompt Engineering for Multilingual Legal Judgement Prediction [2.539568419434224]
Legal Prompt Engineering (LPE) or Legal Prompting is a process to guide and assist a large language model (LLM) with performing a natural legal language processing skill.
We investigate the performance of zero-shot LPE for given facts in case-texts from the European Court of Human Rights (in English) and the Federal Supreme Court of Switzerland (in German, French and Italian)
arXiv Detail & Related papers (2022-12-05T12:17:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.