When Does Pretraining Help? Assessing Self-Supervised Learning for Law
and the CaseHOLD Dataset
- URL: http://arxiv.org/abs/2104.08671v1
- Date: Sun, 18 Apr 2021 00:57:16 GMT
- Title: When Does Pretraining Help? Assessing Self-Supervised Learning for Law
and the CaseHOLD Dataset
- Authors: Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, Daniel
E. Ho
- Abstract summary: We present a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.
We show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus.
Our findings inform when researchers should engage resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.
- Score: 2.0924876102146714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While self-supervised learning has made rapid advances in natural language
processing, it remains unclear when researchers should engage in
resource-intensive domain-specific pretraining (domain pretraining). The law,
puzzlingly, has yielded few documented instances of substantial gains to domain
pretraining in spite of the fact that legal language is widely seen to be
unique. We hypothesize that these existing results stem from the fact that
existing legal NLP tasks are too easy and fail to meet conditions for when
domain pretraining can help. To address this, we first present CaseHOLD (Case
Holdings On Legal Decisions), a new dataset comprised of over 53,000+ multiple
choice questions to identify the relevant holding of a cited case. This dataset
presents a fundamental task to lawyers and is both legally meaningful and
difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second,
we assess performance gains on CaseHOLD and existing legal NLP datasets. While
a Transformer architecture (BERT) pretrained on a general corpus (Google Books
and Wikipedia) improves performance, domain pretraining (using corpus of
approximately 3.5M decisions across all courts in the U.S. that is larger than
BERT's) with a custom legal vocabulary exhibits the most substantial
performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12%
improvement on BERT) and consistent performance gains across two other legal
tasks. Third, we show that domain pretraining may be warranted when the task
exhibits sufficient similarity to the pretraining corpus: the level of
performance increase in three legal tasks was directly tied to the domain
specificity of the task. Our findings inform when researchers should engage
resource-intensive pretraining and show that Transformer-based architectures,
too, learn embeddings suggestive of distinct legal language.
Related papers
- TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text [5.523385345486362]
We have developed language models specifically designed for legal applications.
Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text.
arXiv Detail & Related papers (2024-10-28T19:32:18Z) - LawLLM: Law Large Language Model for the US Legal System [43.13850456765944]
We introduce the Law Large Language Model (LawLLM), a multi-task model specifically designed for the US legal domain.
LawLLM excels at Similar Case Retrieval (SCR), Precedent Case Recommendation (PCR), and Legal Judgment Prediction (LJP)
We propose customized data preprocessing techniques for each task that transform raw legal data into a trainable format.
arXiv Detail & Related papers (2024-07-27T21:51:30Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - Understanding In-Context Learning via Supportive Pretraining Data [55.648777340129364]
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time.
It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations.
Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data.
arXiv Detail & Related papers (2023-06-26T22:14:04Z) - Automated Refugee Case Analysis: An NLP Pipeline for Supporting Legal
Practitioners [0.0]
We introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases.
We investigate an under-studied legal domain with a case study on refugee law in Canada.
arXiv Detail & Related papers (2023-05-24T19:37:23Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Pre-trained Language Models for the Legal Domain: A Case Study on Indian
Law [7.366081387295463]
We re-train two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text.
We observe our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts)
arXiv Detail & Related papers (2022-09-13T15:01:11Z) - Legal Transformer Models May Not Always Help [3.6061626009104057]
This work investigates the value of domain adaptive pre-training and language adapters in legal NLP tasks.
We show that domain adaptive pre-training is only helpful with low-resource downstream tasks.
As an additional result, we release LegalRoBERTa, a RoBERTa model further pre-trained on legal corpora.
arXiv Detail & Related papers (2021-09-14T17:53:55Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.