ShallowBlocker: Improving Set Similarity Joins for Blocking
- URL: http://arxiv.org/abs/2312.15835v1
- Date: Tue, 26 Dec 2023 00:31:43 GMT
- Title: ShallowBlocker: Improving Set Similarity Joins for Blocking
- Authors: Nils Barlaug
- Abstract summary: We propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker.
It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter.
We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.
- Score: 1.8492669447784602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Blocking is a crucial step in large-scale entity matching but often requires
significant manual engineering from an expert for each new dataset. Recent work
has show that deep learning is state-of-the-art and has great potential for
achieving hands-off and accurate blocking compared to classical methods.
However, in practice, such deep learning methods are often unstable, offers
little interpretability, and require hyperparameter tuning and significant
computational resources.
In this paper, we propose a hands-off blocking method based on classical
string similarity measures: ShallowBlocker. It uses a novel hybrid set
similarity join combining absolute similarity, relative similarity, and local
cardinality conditions with a new effective pre-candidate filter replacing size
filter. We show that the method achieves state-of-the-art pair effectiveness on
both unsupervised and supervised blocking in a scalable way.
Related papers
- Towards Universal Dense Blocking for Entity Resolution [49.06313308481536]
We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus.
By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning.
Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
arXiv Detail & Related papers (2024-04-23T08:39:29Z) - Block Sparse Bayesian Learning: A Diversified Scheme [16.61484758008309]
We introduce a novel prior called Diversified Block Sparse Prior to characterize the widespread block sparsity phenomenon in real-world data.
By allowing diversification on intra-block variance and inter-block correlation matrices, we effectively address the sensitivity issue of existing block sparse learning methods to pre-defined block information.
arXiv Detail & Related papers (2024-02-07T08:18:06Z) - Approach of variable clustering and compression for learning large
Bayesian networks [0.0]
This paper describes a new approach for learning structures of large Bayesian networks based on blocks resulting from feature space clustering.
The advantage of the approach is evaluated in terms of speed of work as well as the accuracy of the found structures.
arXiv Detail & Related papers (2022-08-29T13:55:32Z) - Block shuffling learning for Deepfake Detection [9.180904212520355]
Deepfake detection methods based on convolutional neural networks (CNN) have demonstrated high accuracy.
These methods often suffer from decreased performance when faced with unknown forgery methods and common transformations.
We propose a novel block shuffling regularization method to address this issue.
arXiv Detail & Related papers (2022-02-06T17:16:46Z) - Recall@k Surrogate Loss with Large Batches and Similarity Mixup [62.67458021725227]
Direct optimization, by gradient descent, of an evaluation metric is not possible when it is non-differentiable.
In this work, a differentiable surrogate loss for the recall is proposed.
The proposed method achieves state-of-the-art results in several image retrieval benchmarks.
arXiv Detail & Related papers (2021-08-25T11:09:11Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - Fast Network Community Detection with Profile-Pseudo Likelihood Methods [19.639557431997037]
Most algorithms for fitting the block model likelihood function cannot scale to large-scale networks.
We propose a novel likelihood approach that decouples row and column labels in the likelihood function.
We show that our method provides strongly consistent estimates of the communities in a block model.
arXiv Detail & Related papers (2020-11-01T23:40:26Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - LoCo: Local Contrastive Representation Learning [93.98029899866866]
We show that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks.
This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
arXiv Detail & Related papers (2020-08-04T05:41:29Z) - Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models.
We construct a statistical test on a set of row and column cluster memberships of a latent block model.
The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.