Regulatory Compliance through Doc2Doc Information Retrieval: A case
study in EU/UK legislation where text similarity has limitations
- URL: http://arxiv.org/abs/2101.10726v1
- Date: Tue, 26 Jan 2021 11:38:15 GMT
- Title: Regulatory Compliance through Doc2Doc Information Retrieval: A case
study in EU/UK legislation where text similarity has limitations
- Authors: Ilias Chalkidis, Manos Fergadiotis, Nikolaos Manginas, Eva Katakalou
and Prodromos Malakasiotis
- Abstract summary: REG-IR is an application of document-to-document information retrieval.
We show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR.
We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels.
- Score: 6.40476282000118
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Major scandals in corporate history have urged the need for regulatory
compliance, where organizations need to ensure that their controls (processes)
comply with relevant laws, regulations, and policies. However, keeping track of
the constantly changing legislation is difficult, thus organizations are
increasingly adopting Regulatory Technology (RegTech) to facilitate the
process. To this end, we introduce regulatory information retrieval (REG-IR),
an application of document-to-document information retrieval (DOC2DOC IR),
where the query is an entire document making the task more challenging than
traditional IR where the queries are short. Furthermore, we compile and release
two datasets based on the relationships between EU directives and UK
legislation. We experiment on these datasets using a typical two-step pipeline
approach comprising a pre-fetcher and a neural re-ranker. Experimenting with
various pre-fetchers from BM25 to k nearest neighbors over representations from
several BERT models, we show that fine-tuning a BERT model on an in-domain
classification task produces the best representations for IR. We also show that
neural re-rankers under-perform due to contradicting supervision, i.e., similar
query-document pairs with opposite labels. Thus, they are biased towards the
pre-fetcher's score. Interestingly, applying a date filter further improves the
performance, showcasing the importance of the time dimension.
Related papers
- Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - Query-driven Relevant Paragraph Extraction from Legal Judgments [1.2562034805037443]
Legal professionals often grapple with navigating lengthy legal judgements to pinpoint information that directly address their queries.
This paper focus on this task of extracting relevant paragraphs from legal judgements based on the query.
We construct a specialized dataset for this task from the European Court of Human Rights (ECtHR) using the case law guides.
arXiv Detail & Related papers (2024-03-31T08:03:39Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Identification of Regulatory Requirements Relevant to Business
Processes: A Comparative Study on Generative AI, Embedding-based Ranking,
Crowd and Expert-driven Methods [10.899912290518648]
This work examines how legal and domain experts can be assisted in the assessment of relevant requirements.
We compare an embedding-based NLP ranking method, a generative AI method using GPT-4, and a crowdsourced method with the purely manual method of creating labels by experts.
A gold standard is created for both BPMN2.0 processes and matched to real-world requirements from multiple regulatory documents.
arXiv Detail & Related papers (2024-01-02T12:08:31Z) - Exploring Semi-supervised Hierarchical Stacked Encoder for Legal
Judgement Prediction [0.6349503549199403]
We explore and propose a two-level classification mechanism; both supervised and unsupervised.
We use domain-specific pre-trained BERT to extract information from long documents in terms of sentence embeddings further processing with transformer encoder layer.
We see higher performance gains than the previously proposed methods on the ILDC dataset.
arXiv Detail & Related papers (2023-11-14T12:03:26Z) - U-CREAT: Unsupervised Case Retrieval using Events extrAcTion [2.2385755093672044]
We propose a new benchmark (in English) for the Prior Case Retrieval task: IL-PCR (Indian Legal Prior Case Retrieval) corpus.
We explore the role of events in legal case retrieval and propose an unsupervised retrieval method-based pipeline U-CREAT.
We find that the proposed unsupervised retrieval method significantly increases performance compared to BM25 and makes retrieval faster by a considerable margin.
arXiv Detail & Related papers (2023-07-11T13:51:12Z) - WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia.
In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim.
We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z) - Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.