Learning Bill Similarity with Annotated and Augmented Corpora of Bills
- URL: http://arxiv.org/abs/2109.06527v1
- Date: Tue, 14 Sep 2021 08:50:06 GMT
- Title: Learning Bill Similarity with Annotated and Augmented Corpora of Bills
- Authors: Jiseon Kim, Elden Griggs, In Song Kim, Alice Oh
- Abstract summary: We construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level.
We generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process.
We apply our trained model to infer section- and bill-level similarities.
- Score: 9.910141281434319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bill writing is a critical element of representative democracy. However, it
is often overlooked that most legislative bills are derived, or even directly
copied, from other bills. Despite the significance of bill-to-bill linkages for
understanding the legislative process, existing approaches fail to address
semantic similarities across bills, let alone reordering or paraphrasing which
are prevalent in legal document writing. In this paper, we overcome these
limitations by proposing a 5-class classification task that closely reflects
the nature of the bill generation process. In doing so, we construct a
human-labeled dataset of 4,721 bill-to-bill relationships at the
subsection-level and release this annotated dataset to the research community.
To augment the dataset, we generate synthetic data with varying degrees of
similarity, mimicking the complex bill writing process. We use BERT variants
and apply multi-stage training, sequentially fine-tuning our models with
synthetic and human-labeled datasets. We find that the predictive performance
significantly improves when training with both human-labeled and synthetic
data. Finally, we apply our trained model to infer section- and bill-level
similarities. Our analysis shows that the proposed methodology successfully
captures the similarities across legal documents at various levels of
aggregation.
Related papers
- Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts [0.6554326244334866]
Legal-LLM is a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning.
We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores.
arXiv Detail & Related papers (2025-04-12T18:57:04Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - Rhetorical Role Labeling of Legal Documents using Transformers and Graph
Neural Networks [1.290382979353427]
This paper presents the approaches undertaken to perform the task of rhetorical role labelling on Indian Court Judgements as part of SemEval Task 6: understanding legal texts, shared subtask A.
arXiv Detail & Related papers (2023-05-06T17:04:51Z) - DeepParliament: A Legal domain Benchmark & Dataset for Parliament Bills
Prediction [0.0]
This paper introduces DeepParliament, a legal domain Benchmark dataset that gathers bill documents and metadata.
We propose two new benchmarks: Binary and Multi-Class Bill Status classification.
This work will be the first to present a Parliament bill prediction task.
arXiv Detail & Related papers (2022-11-15T04:55:32Z) - A Zipf's Law-based Text Generation Approach for Addressing Imbalance in
Entity Extraction [19.55959053873699]
This paper proposes a novel approach by viewing the issue through the quantitative information.
It recognizes that entities exhibit certain levels of commonality while others are scarce, which can be reflected in the quantifiable distribution of words.
The Zipf's Law emerges as a well-suited adoption, and to transition from words to entities, words within the documents are classified as common and rare ones.
arXiv Detail & Related papers (2022-05-25T10:22:14Z) - An Evaluation Framework for Legal Document Summarization [1.9709122688953327]
A law practitioner has to go through numerous lengthy legal case proceedings for their practices of various categories, such as land dispute, corruption, etc.
It is important to summarize these documents, and ensure that summaries contain phrases with intent matching the category of the case.
We propose an automated intent-based summarization metric, which shows a better agreement with human evaluation as compared to other automated metrics like BLEU, ROUGE-L etc.
arXiv Detail & Related papers (2022-05-17T16:42:03Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Nutribullets Hybrid: Multi-document Health Summarization [36.95954983680022]
We present a method for generating comparative summaries that highlights similarities and contradictions in input documents.
Our framework leads to more faithful, relevant and aggregation-sensitive summarization -- while being equally fluent.
arXiv Detail & Related papers (2021-04-08T01:44:29Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.