Transformer-Based Extraction of Statutory Definitions from the U.S. Code
- URL: http://arxiv.org/abs/2504.16353v1
- Date: Wed, 23 Apr 2025 02:09:53 GMT
- Title: Transformer-Based Extraction of Statutory Definitions from the U.S. Code
- Authors: Arpana Hosabettu, Harsh Shah,
- Abstract summary: We present an advanced NLP system to automatically extract defined terms, their definitions, and their scope from the United States Code (U.S.C.)<n>Our best model achieves 96.8% precision and 98.9% recall (98.2% F1-score)<n>This work contributes to improving accessibility and understanding of legal information while establishing a foundation for downstream legal reasoning tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic extraction of definitions from legal texts is critical for enhancing the comprehension and clarity of complex legal corpora such as the United States Code (U.S.C.). We present an advanced NLP system leveraging transformer-based architectures to automatically extract defined terms, their definitions, and their scope from the U.S.C. We address the challenges of automatically identifying legal definitions, extracting defined terms, and determining their scope within this complex corpus of over 200,000 pages of federal statutory law. Building upon previous feature-based machine learning methods, our updated model employs domain-specific transformers (Legal-BERT) fine-tuned specifically for statutory texts, significantly improving extraction accuracy. Our work implements a multi-stage pipeline that combines document structure analysis with state-of-the-art language models to process legal text from the XML version of the U.S. Code. Each paragraph is first classified using a fine-tuned legal domain BERT model to determine if it contains a definition. Our system then aggregates related paragraphs into coherent definitional units and applies a combination of attention mechanisms and rule-based patterns to extract defined terms and their jurisdictional scope. The definition extraction system is evaluated on multiple titles of the U.S. Code containing thousands of definitions, demonstrating significant improvements over previous approaches. Our best model achieves 96.8% precision and 98.9% recall (98.2% F1-score), substantially outperforming traditional machine learning classifiers. This work contributes to improving accessibility and understanding of legal information while establishing a foundation for downstream legal reasoning tasks.
Related papers
- Computational Identification of Regulatory Statements in EU Legislation [0.0]
A computational method is valuable for scaling the identification of such statements from a growing body of EU legislation.
We provide a specific definition for what constitutes a regulatory statement based on the institutional grammar tool.
arXiv Detail & Related papers (2025-05-01T12:11:32Z) - LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.<n>It adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z) - CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models [59.8529196670565]
CRAT is a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address translation challenges.
Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.
arXiv Detail & Related papers (2024-10-28T14:29:11Z) - Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach [0.0]
This paper proposes a novel hybrid model that enhances the accuracy and precision of Legal-BERT, a transformer model fine-tuned for legal text processing.
We evaluate the model on a dataset of 15,000 annotated legal documents, achieving an F1 score of 93.4%, demonstrating significant improvements in precision and recall over previous methods.
arXiv Detail & Related papers (2024-10-11T04:51:28Z) - Explainable machine learning multi-label classification of Spanish legal judgements [6.817247544942709]
We propose a hybrid system that applies Machine Learning for multi-label classification of judgements (sentences) and visual and natural language descriptions for explanation purposes.
Our solution achieves over 85 % micro precision on a labelled data set annotated by legal experts.
arXiv Detail & Related papers (2024-05-27T19:16:42Z) - LegalPro-BERT: Classification of Legal Provisions by fine-tuning BERT Large Language Model [0.0]
Contract analysis requires the identification and classification of key provisions and paragraphs within an agreement.
LegalPro-BERT is a BERT transformer architecture model that we fine- tune to efficiently handle classification task for legal provisions.
arXiv Detail & Related papers (2024-04-15T19:08:48Z) - Automatic explanation of the classification of Spanish legal judgments in jurisdiction-dependent law categories with tree estimators [6.354358255072839]
This work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner.
We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures.
Legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries.
arXiv Detail & Related papers (2024-03-30T17:59:43Z) - DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval.
We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability.
Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - CsFEVER and CTKFacts: Czech Datasets for Fact Verification [0.0]
We present two Czech datasets aimed for training automated fact-checking machine learning models.
The first dataset is CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset.
The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports.
arXiv Detail & Related papers (2022-01-26T18:48:42Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.