LegalCore: A Dataset for Legal Documents Event Coreference Resolution
- URL: http://arxiv.org/abs/2502.12509v1
- Date: Tue, 18 Feb 2025 03:47:53 GMT
- Title: LegalCore: A Dataset for Legal Documents Event Coreference Resolution
- Authors: Kangda Wei, Xi Shi, Jonathan Tong, Sai Ramana Reddy, Anandhavelu Natarajan, Rajiv Jain, Aparna Garimella, Ruihong Huang,
- Abstract summary: We present the first dataset for the legal domain, LegalCore, annotated with comprehensive event and event coreference information.
The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document.
We benchmark mainstream Large Language Models on this dataset for both event detection and event coreference resolution tasks.
- Score: 21.113915852038552
- License:
- Abstract: Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.
Related papers
- Enhancing Cross-Document Event Coreference Resolution by Discourse Structure and Semantic Information [33.21818213257603]
Cross-document event coreference resolution models can only compute mention similarity directly or enhance mention representation by extracting event arguments.
We propose the construction of document-level Rhetorical Structure Theory (RST) trees and cross-document Lexical Chains to model the structural and semantic information of documents.
We have developed a large-scale Chinese cross-document event coreference dataset to fill this gap.
arXiv Detail & Related papers (2024-06-23T02:54:48Z) - MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument Annotation [104.6065882758648]
MAVEN-Arg is the first all-in-one dataset supporting event detection, event argument extraction, and event relation extraction.
As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained with laborious human annotation; and (3) the exhaustive annotation supporting all task variants of EAE.
arXiv Detail & Related papers (2023-11-15T16:52:14Z) - FAMuS: Frames Across Multiple Sources [74.03795560933612]
FAMuS is a new corpus of Wikipedia passages that emphreport on some event, paired with underlying, genre-diverse (non-Wikipedia) emphsource articles for the same event.
We present results on two key event understanding tasks enabled by FAMuS.
arXiv Detail & Related papers (2023-11-09T18:57:39Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - GLEN: General-Purpose Event Detection for Thousands of Types [80.99866527772512]
We build a general-purpose event detection dataset GLEN, which covers 205K event mentions with 3,465 different types.
GLEN is 20x larger in ontology than today's largest event dataset.
We also propose a new multi-stage event detection model CEDAR specifically designed to handle the large size in GLEN.
arXiv Detail & Related papers (2023-03-16T05:36:38Z) - Cross-document Event Coreference Search: Task, Dataset and Modeling [26.36068336169796]
We propose an appealing, and often more applicable, complementary set up for the task - Cross-document Coreference Search.
To support research on this task, we create a corresponding dataset, which is derived from Wikipedia.
We present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.
arXiv Detail & Related papers (2022-10-23T08:21:25Z) - LEVEN: A Large-Scale Chinese Legal Event Detection Dataset [82.44096140591675]
We present LEVEN, a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types.
LEVEN is the largest Legal Event Detection dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods.
arXiv Detail & Related papers (2022-03-16T11:40:02Z) - Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum.
By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z) - Cross-document Event Identity via Dense Annotation [9.163142877146512]
We study the identity of textual events from different documents.
We propose a dense annotation approach for cross-document event coreference.
We present an open-access dataset for cross-document event coreference.
arXiv Detail & Related papers (2021-09-14T03:57:58Z) - WEC: Deriving a Large-scale Cross-document Event Coreference dataset
from Wikipedia [14.324743524196874]
We present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia.
We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset.
We develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting.
arXiv Detail & Related papers (2021-04-11T14:54:35Z) - Seeing the Forest and the Trees: Detection and Cross-Document
Coreference Resolution of Militarized Interstate Disputes [3.8073142980733]
I provide a data set for evaluating methods to identify certain political events in text and to link related texts to one another based on shared events.
The data set, Headlines of War, is built on the Militarized Interstate Disputes data set and offers headlines classified by dispute status and headline pairs labeled with coreference indicators.
I introduce a model capable of accomplishing both tasks. The multi-task convolutional neural network is shown to be capable of recognizing events and event coreferences given the headlines' texts and publication dates.
arXiv Detail & Related papers (2020-05-06T17:20:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.