Lawma: The Power of Specialization for Legal Tasks
- URL: http://arxiv.org/abs/2407.16615v1
- Date: Tue, 23 Jul 2024 16:23:04 GMT
- Title: Lawma: The Power of Specialization for Legal Tasks
- Authors: Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael Livermore,
- Abstract summary: We study 260 legal text classification tasks, nearly all new to the machine learning community.
A lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points.
We find that larger models respond better to fine-tuning than smaller models.
- Score: 18.45967769381101
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models [5.748694060126043]
We evaluate four different types of discriminatory practices within visual-language models.
We introduce FairPIthera, a method to reduce them by removing the most affected dimensions of feature embeddings.
The application of FairPIthera has led to a significant reduction of up to 98% in observed biases.
arXiv Detail & Related papers (2024-09-28T22:49:22Z) - Revisiting the Superficial Alignment Hypothesis [0.9831489366502302]
The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training.
We re-examine these claims by studying the scaling behavior of post-training with increasing finetuning examples.
arXiv Detail & Related papers (2024-09-27T22:14:10Z) - The Art of Saying No: Contextual Noncompliance in Language Models [123.383993700586]
We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests.
Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests.
To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts.
arXiv Detail & Related papers (2024-07-02T07:12:51Z) - Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains [19.814974042343028]
We examine the capacity of instruction-tuned language models to follow in-context concept guidelines for sentence labeling tasks.
Our results show that although concept definitions consistently help in task performance, only the larger models have limited ability to work under counterfactual contexts.
arXiv Detail & Related papers (2023-11-15T05:11:26Z) - Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement [3.537369004801589]
We study the classification of legal reasoning according to jurisprudential philosophy.
We use a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts.
We find that generative models perform poorly when given instructions equal to the instructions presented to human annotators.
arXiv Detail & Related papers (2023-10-27T19:27:59Z) - Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Measuring Massive Multitask Language Understanding [79.6985576698597]
The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
Largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Models also have lopsided performance and frequently do not know when they are wrong.
arXiv Detail & Related papers (2020-09-07T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.