Graph-based hierarchical record clustering for unsupervised entity
resolution
- URL: http://arxiv.org/abs/2112.06331v1
- Date: Sun, 12 Dec 2021 21:58:07 GMT
- Title: Graph-based hierarchical record clustering for unsupervised entity
resolution
- Authors: Islam Akef Ebeid, John R. Talburt, Md Abdus Salam Siddique
- Abstract summary: We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM)
We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or soft clusters in the matched record pairs.
That is followed by breaking down the discovered soft clusters into more precise entity clusters in a hierarchical manner.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Here we study the problem of matched record clustering in unsupervised entity
resolution. We build upon a state-of-the-art probabilistic framework named the
Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step
record clustering method (GDWM) that first identifies large, connected
components or, as we call them, soft clusters in the matched record pairs using
a graph-based transitive closure algorithm utilized in the DWM. That is
followed by breaking down the discovered soft clusters into more precise entity
clusters in a hierarchical manner using an adapted graph-based modularity
optimization method. Our approach provides several advantages over the original
implementation of the DWM, mainly a significant speed-up, increased precision,
and overall increased F1 scores. We demonstrate the efficacy of our approach
using experiments on multiple synthetic datasets. Our results also provide
evidence of the utility of graph theory-based algorithms despite their sparsity
in the literature on unsupervised entity resolution.
Related papers
- Fast and Scalable Semi-Supervised Learning for Multi-View Subspace Clustering [13.638434337947302]
FSSMSC is a novel solution to the high computational complexity commonly found in existing approaches.
The method generates a consensus anchor graph across all views, representing each data point as a sparse linear combination of chosen landmarks.
The effectiveness and efficiency of FSSMSC are validated through extensive experiments on multiple benchmark datasets of varying scales.
arXiv Detail & Related papers (2024-08-11T06:54:00Z) - A Clustering Method with Graph Maximum Decoding Information [6.11503045313947]
We present a novel clustering method for maximizing decoding information within graph-based models, named CMDI.
CMDI incorporates two-dimensional structural information theory into the clustering process, consisting of two phases: graph structure extraction and graph partitioning.
Empirical evaluations on three real-world datasets demonstrate that CMDI outperforms classical baseline methods, exhibiting a superior decoding information ratio (DI-R)
These findings underscore the effectiveness of CMDI in enhancing decoding information quality and computational efficiency, positioning it as a valuable tool in graph-based clustering analyses.
arXiv Detail & Related papers (2024-03-18T05:18:19Z) - Deep Contrastive Graph Learning with Clustering-Oriented Guidance [61.103996105756394]
Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering.
Models estimate an initial graph beforehand to apply GCN.
Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering.
arXiv Detail & Related papers (2024-02-25T07:03:37Z) - One-step Bipartite Graph Cut: A Normalized Formulation and Its
Application to Scalable Subspace Clustering [56.81492360414741]
We show how to enforce a one-step normalized cut for bipartite graphs, especially with linear-time complexity.
In this paper, we first characterize a novel one-step bipartite graph cut criterion with normalized constraints, and theoretically prove its equivalence to a trace problem.
We extend this cut criterion to a scalable subspace clustering approach, where adaptive anchor learning, bipartite graph learning, and one-step normalized bipartite graph partitioning are simultaneously modeled.
arXiv Detail & Related papers (2023-05-12T11:27:20Z) - Dual Contrastive Attributed Graph Clustering Network [6.796682703663566]
We propose a generic framework called Dual Contrastive Attributed Graph Clustering Network (DCAGC)
In DCAGC, by leveraging Neighborhood Contrast Module, the similarity of the neighbor nodes will be maximized and the quality of the node representation will be improved.
All the modules of DCAGC are trained and optimized in a unified framework, so the learned node representation contains clustering-oriented messages.
arXiv Detail & Related papers (2022-06-16T03:17:01Z) - Interpolation-based Correlation Reduction Network for Semi-Supervised
Graph Learning [49.94816548023729]
We propose a novel graph contrastive learning method, termed Interpolation-based Correlation Reduction Network (ICRN)
In our method, we improve the discriminative capability of the latent feature by enlarging the margin of decision boundaries.
By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discnative representation learning.
arXiv Detail & Related papers (2022-06-06T14:26:34Z) - Deep Graph Clustering via Dual Correlation Reduction [37.973072977988494]
We propose a novel self-supervised deep graph clustering method termed Dual Correlation Reduction Network (DCRN)
In our method, we first design a siamese network to encode samples. Then by forcing the cross-view sample correlation matrix and cross-view feature correlation matrix to approximate two identity matrices, respectively, we reduce the information correlation in the dual-level.
In order to alleviate representation collapse caused by over-smoothing in GCN, we introduce a propagation regularization term to enable the network to gain long-distance information.
arXiv Detail & Related papers (2021-12-29T04:05:38Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Deep Attention-guided Graph Clustering with Dual Self-supervision [49.040136530379094]
We propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC)
We develop a dual self-supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss.
Our method consistently outperforms state-of-the-art methods on six benchmark datasets.
arXiv Detail & Related papers (2021-11-10T06:53:03Z) - Effective and Efficient Graph Learning for Multi-view Clustering [173.8313827799077]
We propose an effective and efficient graph learning model for multi-view clustering.
Our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm.
Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size.
arXiv Detail & Related papers (2021-08-15T13:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.