Pile of Law: Learning Responsible Data Filtering from the Law and a
256GB Open-Source Legal Dataset
- URL: http://arxiv.org/abs/2207.00220v2
- Date: Tue, 29 Nov 2022 08:59:40 GMT
- Title: Pile of Law: Learning Responsible Data Filtering from the Law and a
256GB Open-Source Legal Dataset
- Authors: Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D.
Manning, Dan Jurafsky, Daniel E. Ho
- Abstract summary: We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.
First, we gather and make available the Pile of Law, a 256GB dataset of open-source English-language legal and administrative data.
Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons.
Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data.
- Score: 46.156169284961045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One concern with the rise of large language models lies with their potential
for significant harm, particularly from pretraining on biased, obscene,
copyrighted, and private information. Emerging ethical approaches have
attempted to filter pretraining material, but such approaches have been ad hoc
and failed to take context into account. We offer an approach to filtering
grounded in law, which has directly addressed the tradeoffs in filtering
material. First, we gather and make available the Pile of Law, a 256GB (and
growing) dataset of open-source English-language legal and administrative data,
covering court opinions, contracts, administrative rules, and legislative
records. Pretraining on the Pile of Law may help with legal tasks that have the
promise to improve access to justice. Second, we distill the legal norms that
governments have developed to constrain the inclusion of toxic or private
content into actionable lessons for researchers and discuss how our dataset
reflects these norms. Third, we show how the Pile of Law offers researchers the
opportunity to learn such filtering rules directly from the data, providing an
exciting new research direction in model-based processing.
Related papers
- LegiLM: A Fine-Tuned Legal Language Model for Data Compliance [5.256747140296861]
LegiLM is a novel legal language model specifically tailored for consulting on data or information compliance.
It has been fine-tuned to automatically assess whether particular actions or events breach data security and privacy regulations.
LegiLM excels in detecting data regulation breaches, offering sound legal justifications, and recommending necessary compliance modifications.
arXiv Detail & Related papers (2024-09-09T02:06:52Z) - LawLLM: Law Large Language Model for the US Legal System [43.13850456765944]
We introduce the Law Large Language Model (LawLLM), a multi-task model specifically designed for the US legal domain.
LawLLM excels at Similar Case Retrieval (SCR), Precedent Case Recommendation (PCR), and Legal Judgment Prediction (LJP)
We propose customized data preprocessing techniques for each task that transform raw legal data into a trainable format.
arXiv Detail & Related papers (2024-07-27T21:51:30Z) - It Cannot Be Right If It Was Written by AI: On Lawyers' Preferences of Documents Perceived as Authored by an LLM vs a Human [0.6827423171182154]
Large Language Models (LLMs) enable a future in which certain types of legal documents may be generated automatically.
This study is the necessary analysis of the ongoing transition towards mature generative AI systems.
Our analysis revealed a clear preference for documents perceived as crafted by a human over those believed to be generated by AI.
arXiv Detail & Related papers (2024-07-09T12:11:25Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - Should I disclose my dataset? Caveats between reproducibility and
individual data rights [5.816090284071069]
Digital availability of court documents increases possibilities for researchers.
However, personal data protection laws impose restrictions on data exposure.
We present legal and ethical considerations on the issue, as well as guidelines for researchers.
arXiv Detail & Related papers (2022-11-01T14:42:11Z) - Having your Privacy Cake and Eating it Too: Platform-supported Auditing
of Social Media Algorithms for Public Interest [70.02478301291264]
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse.
Prior studies have used black-box methods to show that these algorithms can lead to biased or discriminatory outcomes.
We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation.
arXiv Detail & Related papers (2022-07-18T17:32:35Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z) - A Legal Approach to Hate Speech: Operationalizing the EU's Legal
Framework against the Expression of Hatred as an NLP Task [2.248133901806859]
We propose a 'legal approach' to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law.
We show that, by breaking the legal assessment down into a series of simpler sub-decisions, even laypersons can annotate consistently.
arXiv Detail & Related papers (2020-04-07T14:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.