Related papers: ShallowBlocker: Improving Set Similarity Joins for Blocking

ShallowBlocker: Improving Set Similarity Joins for Blocking

URL: http://arxiv.org/abs/2312.15835v1
Date: Tue, 26 Dec 2023 00:31:43 GMT
Title: ShallowBlocker: Improving Set Similarity Joins for Blocking
Authors: Nils Barlaug
Abstract summary: We propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker. It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter. We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.
Score: 1.8492669447784602
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Blocking is a crucial step in large-scale entity matching but often requires significant manual engineering from an expert for each new dataset. Recent work has show that deep learning is state-of-the-art and has great potential for achieving hands-off and accurate blocking compared to classical methods. However, in practice, such deep learning methods are often unstable, offers little interpretability, and require hyperparameter tuning and significant computational resources. In this paper, we propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker. It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter. We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.

Related papers

Towards Universal Dense Blocking for Entity Resolution [49.06313308481536]
We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus. By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
arXiv Detail & Related papers (2024-04-23T08:39:29Z)
Block Sparse Bayesian Learning: A Diversified Scheme [16.61484758008309]
We introduce a novel prior called Diversified Block Sparse Prior to characterize the widespread block sparsity phenomenon in real-world data. By allowing diversification on intra-block variance and inter-block correlation matrices, we effectively address the sensitivity issue of existing block sparse learning methods to pre-defined block information.
arXiv Detail & Related papers (2024-02-07T08:18:06Z)
Approach of variable clustering and compression for learning large Bayesian networks [0.0]
This paper describes a new approach for learning structures of large Bayesian networks based on blocks resulting from feature space clustering. The advantage of the approach is evaluated in terms of speed of work as well as the accuracy of the found structures.
arXiv Detail & Related papers (2022-08-29T13:55:32Z)
Block shuffling learning for Deepfake Detection [9.180904212520355]
Deepfake detection methods based on convolutional neural networks (CNN) have demonstrated high accuracy. These methods often suffer from decreased performance when faced with unknown forgery methods and common transformations. We propose a novel block shuffling regularization method to address this issue.
arXiv Detail & Related papers (2022-02-06T17:16:46Z)
Recall@k Surrogate Loss with Large Batches and Similarity Mixup [62.67458021725227]
Direct optimization, by gradient descent, of an evaluation metric is not possible when it is non-differentiable. In this work, a differentiable surrogate loss for the recall is proposed. The proposed method achieves state-of-the-art results in several image retrieval benchmarks.
arXiv Detail & Related papers (2021-08-25T11:09:11Z)
Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks. The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z)
Fast Network Community Detection with Profile-Pseudo Likelihood Methods [19.639557431997037]
Most algorithms for fitting the block model likelihood function cannot scale to large-scale networks. We propose a novel likelihood approach that decouples row and column labels in the likelihood function. We show that our method provides strongly consistent estimates of the communities in a block model.
arXiv Detail & Related papers (2020-11-01T23:40:26Z)
CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON) First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z)
LoCo: Local Contrastive Representation Learning [93.98029899866866]
We show that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks. This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
arXiv Detail & Related papers (2020-08-04T05:41:29Z)
Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models. We construct a statistical test on a set of row and column cluster memberships of a latent block model. The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.