TransClean: Finding False Positives in Multi-Source Entity Matching under Real-World Conditions via Transitive Consistency
- URL: http://arxiv.org/abs/2506.04006v1
- Date: Wed, 04 Jun 2025 14:33:41 GMT
- Title: TransClean: Finding False Positives in Multi-Source Entity Matching under Real-World Conditions via Transitive Consistency
- Authors: Fernando de Meer Pardo, Branka Hadji Misheva, Martin Braschler, Kurt Stockinger,
- Abstract summary: We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions.<n>TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner.<n>Our experiments show that TransClean induces an average +24.42 F1 score improvement for entity matching in a multi-source setting.
- Score: 43.06143768014157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions characterized by large-scale, noisy, and unlabeled multi-source datasets that undergo distributional shifts. TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner while accounting for edge cases and requiring limited manual labeling. TransClean leverages the Transitive Consistency of a matching, a measure of the consistency of a pairwise matching model f_theta on the matching it produces G_f_theta, based both on its predictions on directly evaluated record pairs and its predictions on implied record pairs. TransClean iteratively modifies a matching through gradually removing false positive matches while removing as few true positive matches as possible. In each of these steps, the estimation of the Transitive Consistency is exclusively done through model evaluations and produces quantities that can be used as proxies of the amounts of true and false positives in the matching while not requiring any manual labeling, producing an estimate of the quality of the matching and indicating which record groups are likely to contain false positives. In our experiments, we compare combining TransClean with a naively trained pairwise matching model (DistilBERT) and with a state-of-the-art end-to-end matching method (CLER) and illustrate the flexibility of TransClean in being able to detect most of the false positives of either setup across a variety of datasets. Our experiments show that TransClean induces an average +24.42 F1 score improvement for entity matching in a multi-source setting when compared to traditional pair-wise matching algorithms.
Related papers
- From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching [11.158961763380278]
Recent alternatives improve robustness by leveraging test-time data, but such data may be unavailable in practice.<n>We take a data-centric approach by leveraging invariant data pairs and noisy counterfactual matching.<n>We validate on a synthetic dataset and demonstrate on real-world benchmarks that linear probing on a pretrained backbone improves robustness.
arXiv Detail & Related papers (2025-05-30T17:42:32Z) - Search-Based Correction of Reasoning Chains for Language Models [72.61861891295302]
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs)<n>We introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity.<n>We also introduce Search Corrector, a discrete search algorithm over-valued veracity assignments.
arXiv Detail & Related papers (2025-05-17T04:16:36Z) - Fractional Correspondence Framework in Detection Transformer [13.388933240897492]
The Detection Transformer (DETR) has significantly simplified the matching process in object detection tasks.<n>This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training.<n>We propose a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences.
arXiv Detail & Related papers (2025-03-06T05:29:20Z) - GraLMatch: Matching Groups of Entities with Graphs and Language Models [35.75564019239946]
We present an end-to-end multi-source Entity Matching problem.
The goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity.
We show how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records.
arXiv Detail & Related papers (2024-06-21T09:44:16Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD)
We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector.
Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z) - Contrastive pretraining for semantic segmentation is robust to noisy
positive pairs [0.0]
Domain-specific variants of contrastive learning can construct positive pairs from two distinct images.
We find that downstream semantic segmentation is either robust to the noisy pairs or even benefits from them.
arXiv Detail & Related papers (2022-11-24T18:59:01Z) - Generate, Discriminate and Contrast: A Semi-Supervised Sentence
Representation Learning Framework [68.04940365847543]
We propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data.
Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data.
arXiv Detail & Related papers (2022-10-30T10:15:21Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Contrastive Attraction and Contrastive Repulsion for Representation
Learning [131.72147978462348]
Contrastive learning (CL) methods learn data representations in a self-supervision manner, where the encoder contrasts each positive sample over multiple negative samples.
Recent CL methods have achieved promising results when pretrained on large-scale datasets, such as ImageNet.
We propose a doubly CL strategy that separately compares positive and negative samples within their own groups, and then proceeds with a contrast between positive and negative groups.
arXiv Detail & Related papers (2021-05-08T17:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.