Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement
- URL: http://arxiv.org/abs/2310.18440v1
- Date: Fri, 27 Oct 2023 19:27:59 GMT
- Title: Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement
- Authors: Rosamond Thalken, Edward H. Stiglitz, David Mimno, and Matthew Wilkens
- Abstract summary: We study the classification of legal reasoning according to jurisprudential philosophy.
We use a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts.
We find that generative models perform poorly when given instructions equal to the instructions presented to human annotators.
- Score: 3.537369004801589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative language models (LMs) are increasingly used for document
class-prediction tasks and promise enormous improvements in cost and
efficiency. Existing research often examines simple classification tasks, but
the capability of LMs to classify on complex or specialized tasks is less well
understood. We consider a highly complex task that is challenging even for
humans: the classification of legal reasoning according to jurisprudential
philosophy. Using a novel dataset of historical United States Supreme Court
opinions annotated by a team of domain experts, we systematically test the
performance of a variety of LMs. We find that generative models perform poorly
when given instructions (i.e. prompts) equal to the instructions presented to
human annotators through our codebook. Our strongest results derive from
fine-tuning models on the annotated dataset; the best performing model is an
in-domain model, LEGAL-BERT. We apply predictions from this fine-tuned model to
study historical trends in jurisprudence, an exercise that both aligns with
prominent qualitative historical accounts and points to areas of possible
refinement in those accounts. Our findings generally sound a note of caution in
the use of generative LMs on complex tasks without fine-tuning and point to the
continued relevance of human annotation-intensive classification methods.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Revisiting the Superficial Alignment Hypothesis [0.9831489366502302]
The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training.
We re-examine these claims by studying the scaling behavior of post-training with increasing finetuning examples.
arXiv Detail & Related papers (2024-09-27T22:14:10Z) - A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets [0.0]
This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data.
We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects.
The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
arXiv Detail & Related papers (2024-09-09T18:10:05Z) - Lawma: The Power of Specialization for Legal Tasks [18.45967769381101]
We study 260 legal text classification tasks, nearly all new to the machine learning community.
A lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points.
We find that larger models respond better to fine-tuning than smaller models.
arXiv Detail & Related papers (2024-07-23T16:23:04Z) - Applicability of Large Language Models and Generative Models for Legal Case Judgement Summarization [5.0645491201288495]
In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity.
In this paper, we explore the applicability of such models for legal case judgement summarization.
arXiv Detail & Related papers (2024-07-06T04:49:40Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Did the Models Understand Documents? Benchmarking Models for Language
Understanding in Document-Level Relation Extraction [2.4665182280122577]
Document-level relation extraction (DocRE) attracts more research interest recently.
While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied.
In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model.
arXiv Detail & Related papers (2023-06-20T08:52:05Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.