GraLMatch: Matching Groups of Entities with Graphs and Language Models
- URL: http://arxiv.org/abs/2406.15015v1
- Date: Fri, 21 Jun 2024 09:44:16 GMT
- Title: GraLMatch: Matching Groups of Entities with Graphs and Language Models
- Authors: Fernando De Meer Pardo, Claude Lehmann, Dennis Gehrig, Andrea Nagy, Stefano Nicoli, Branka Hadji Misheva, Martin Braschler, Kurt Stockinger,
- Abstract summary: We present an end-to-end multi-source Entity Matching problem.
The goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity.
We show how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records.
- Score: 35.75564019239946
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records.
Related papers
- Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs [0.09471093245585005]
Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information.<n>Current approaches may fall short in scenarios where diverse and complex contexts need to be integrated.<n>We propose a novel KG integration method consisting of label matching and triple matching.
arXiv Detail & Related papers (2025-07-20T07:46:55Z) - TransClean: Finding False Positives in Multi-Source Entity Matching under Real-World Conditions via Transitive Consistency [43.06143768014157]
We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions.<n>TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner.<n>Our experiments show that TransClean induces an average +24.42 F1 score improvement for entity matching in a multi-source setting.
arXiv Detail & Related papers (2025-06-04T14:33:41Z) - Entity Matching using Large Language Models [3.7277730514654555]
This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent alternative to PLM-based matchers.
We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors.
arXiv Detail & Related papers (2023-10-17T13:12:32Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - GVdoc: Graph-based Visual Document Classification [17.350393956461783]
We propose GVdoc, a graph-based document classification model.
Our approach generates a document graph based on its layout, and then trains a graph neural network to learn node and graph embeddings.
We show that our model, even with fewer parameters, outperforms state-of-the-art models on out-of-distribution data.
arXiv Detail & Related papers (2023-05-26T19:23:20Z) - Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Unsupervised Matching of Data and Text [6.2520079463149205]
We introduce a framework that supports matching textual content and structured data in an unsupervised setting.
Our method builds a fine-grained graph over the content of the corpora and derives word embeddings to represent the objects to match in a low dimensional space.
Experiments on real use cases and public datasets show that our framework produces embeddings that outperform word embeddings and fine-tuned language models.
arXiv Detail & Related papers (2021-12-16T10:40:48Z) - Ranking Models in Unlabeled New Environments [74.33770013525647]
We introduce the problem of ranking models in unlabeled new environments.
We use a proxy dataset that 1) is fully labeled and 2) well reflects the true model rankings in a given target environment.
Specifically, datasets that are more similar to the unlabeled target domain are found to better preserve the relative performance rankings.
arXiv Detail & Related papers (2021-08-23T17:57:15Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Supervised machine learning techniques for data matching based on
similarity metrics [0.0]
Data matching is the field that tries to identify instances in data that refer to the same real-world entity.
In this study, machine learning techniques are combined with string similarity functions to the field of data matching.
The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions.
arXiv Detail & Related papers (2020-07-08T10:04:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.