Related papers: Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations

Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations

URL: http://arxiv.org/abs/2101.10726v1
Date: Tue, 26 Jan 2021 11:38:15 GMT
Title: Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations
Authors: Ilias Chalkidis, Manos Fergadiotis, Nikolaos Manginas, Eva Katakalou and Prodromos Malakasiotis
Abstract summary: REG-IR is an application of document-to-document information retrieval. We show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels.
Score: 6.40476282000118
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process. To this end, we introduce regulatory information retrieval (REG-IR), an application of document-to-document information retrieval (DOC2DOC IR), where the query is an entire document making the task more challenging than traditional IR where the queries are short. Furthermore, we compile and release two datasets based on the relationships between EU directives and UK legislation. We experiment on these datasets using a typical two-step pipeline approach comprising a pre-fetcher and a neural re-ranker. Experimenting with various pre-fetchers from BM25 to k nearest neighbors over representations from several BERT models, we show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels. Thus, they are biased towards the pre-fetcher's score. Interestingly, applying a date filter further improves the performance, showcasing the importance of the time dimension.

Related papers

Segment First, Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries [3.552993426200889]
TraceRetriever mirrors real-world legal search by operating with limited case information.<n>Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion.<n> Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments.
arXiv Detail & Related papers (2025-08-01T14:49:33Z)
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence [56.09494651178128]
Retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG) We show that retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. We show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs.
arXiv Detail & Related papers (2025-03-06T23:23:13Z)
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description. Existing works mainly focus on case-to-case retrieval using lengthy queries. Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z)
Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z)
Query-driven Relevant Paragraph Extraction from Legal Judgments [1.2562034805037443]
Legal professionals often grapple with navigating lengthy legal judgements to pinpoint information that directly address their queries. This paper focus on this task of extracting relevant paragraphs from legal judgements based on the query. We construct a specialized dataset for this task from the European Court of Human Rights (ECtHR) using the case law guides.
arXiv Detail & Related papers (2024-03-31T08:03:39Z)
Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain. We propose an adversarial algorithm to make the retriever component robust against distribution shift. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z)
Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods [10.899912290518648]
This work examines how legal and domain experts can be assisted in the assessment of relevant requirements. We compare an embedding-based NLP ranking method, a generative AI method using GPT-4, and a crowdsourced method with the purely manual method of creating labels by experts. A gold standard is created for both BPMN2.0 processes and matched to real-world requirements from multiple regulatory documents.
arXiv Detail & Related papers (2024-01-02T12:08:31Z)
Exploring Semi-supervised Hierarchical Stacked Encoder for Legal Judgement Prediction [0.6349503549199403]
We explore and propose a two-level classification mechanism; both supervised and unsupervised. We use domain-specific pre-trained BERT to extract information from long documents in terms of sentence embeddings further processing with transformer encoder layer. We see higher performance gains than the previously proposed methods on the ILDC dataset.
arXiv Detail & Related papers (2023-11-14T12:03:26Z)
U-CREAT: Unsupervised Case Retrieval using Events extrAcTion [2.2385755093672044]
We propose a new benchmark (in English) for the Prior Case Retrieval task: IL-PCR (Indian Legal Prior Case Retrieval) corpus. We explore the role of events in legal case retrieval and propose an unsupervised retrieval method-based pipeline U-CREAT. We find that the proposed unsupervised retrieval method significantly increases performance compared to BM25 and makes retrieval faster by a considerable margin.
arXiv Detail & Related papers (2023-07-11T13:51:12Z)
WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia. In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim. We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z)
Does Recommend-Revise Produce Reliable Annotations? An Analysis on Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations. The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.