SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines
- URL: http://arxiv.org/abs/2303.03132v2
- Date: Fri, 23 Jun 2023 12:31:35 GMT
- Title: SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines
- Authors: Alexander Brinkmann, Roee Shraga, Christian Bizer
- Abstract summary: This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space.
We benchmark SC-Block against eight state-of-the-art blocking methods.
For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher.
- Score: 75.5113002732746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of entity resolution is to identify records in multiple datasets
that represent the same real-world entity. However, comparing all records
across datasets can be computationally intensive, leading to long runtimes. To
reduce these runtimes, entity resolution pipelines are constructed of two
parts: a blocker that applies a computationally cheap method to select
candidate record pairs, and a matcher that afterwards identifies matching pairs
from this set using more expensive methods. This paper presents SC-Block, a
blocking method that utilizes supervised contrastive learning for positioning
records in the embedding space, and nearest neighbour search for candidate set
building. We benchmark SC-Block against eight state-of-the-art blocking
methods. In order to relate the training time of SC-Block to the reduction of
the overall runtime of the entity resolution pipeline, we combine SC-Block with
four matching methods into complete pipelines. For measuring the overall
runtime, we determine candidate sets with 99.5% pair completeness and pass them
to the matcher. The results show that SC-Block is able to create smaller
candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster
compared to pipelines with other blockers, without sacrificing F1 score.
Blockers are often evaluated using relatively small datasets which might lead
to runtime effects resulting from a large vocabulary size being overlooked. In
order to measure runtimes in a more challenging setting, we introduce a new
benchmark dataset that requires large numbers of product offers to be blocked.
On this large-scale benchmark dataset, pipelines utilizing SC-Block and the
best-performing matcher execute 8 times faster than pipelines utilizing another
blocker with the same matcher reducing the runtime from 2.5 hours to 18
minutes, clearly compensating for the 5 minutes required for training SC-Block.
Related papers
- FastGAS: Fast Graph-based Annotation Selection for In-Context Learning [53.17606395275021]
In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts.
Existing methods have proposed to select a subset of unlabeled examples for annotation.
We propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances.
arXiv Detail & Related papers (2024-06-06T04:05:54Z) - Parsimonious Optimal Dynamic Partial Order Reduction [1.5029560229270196]
We present Parsimonious-OPtimal (POP) DPOR algorithm for analyzing multi-threaded programs under sequential consistency.
POP combines several novel techniques, including (i) a parsimonious race reversal strategy, which avoids multiple reversals of the same race, and (ii) an eager race reversal strategy to avoid storing initial fragments of to-be-explored executions.
Our implementation in Nidhugg shows that these techniques can significantly speed up the analysis of concurrent programs, and do so with low memory consumption.
arXiv Detail & Related papers (2024-05-18T00:07:26Z) - ShallowBlocker: Improving Set Similarity Joins for Blocking [1.8492669447784602]
We propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker.
It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter.
We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.
arXiv Detail & Related papers (2023-12-26T00:31:43Z) - Divide&Classify: Fine-Grained Classification for City-Wide Visual Place
Recognition [21.039399444257807]
Divide&Classify (D&C) enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting.
We show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall.
arXiv Detail & Related papers (2023-07-17T11:57:04Z) - Pipe-BD: Pipelined Parallel Blockwise Distillation [7.367308544773381]
We propose Pipe-BD, a novel parallelization method for blockwise distillation.
Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation.
We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
arXiv Detail & Related papers (2023-01-29T13:38:43Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Beta-CROWN: Efficient Bound Propagation with Per-neuron Split
Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits.
$beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification.
By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z) - AugSplicing: Synchronized Behavior Detection in Streaming Tensors [38.90084196554039]
We propose a fast streaming algorithm, AugSplicing, which can detect dense blocks by splicing the previous detection with incoming ones in news.
Compared to the state-of-the-art methods, our method is (1) effective to detect fraudulent behavior in installing data of real-world apps and find a group of students with interesting features in campus Wi-Fi data.
arXiv Detail & Related papers (2020-12-03T15:39:58Z) - Distillation Guided Residual Learning for Binary Convolutional Neural
Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN)
We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.
To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN.
This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z) - Distributed Optimization over Block-Cyclic Data [48.317899174302305]
We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure.
We propose two new distributed optimization algorithms called multi-model parallel SGD (MM-PSGD) and multi-chain parallel SGD (MC-PSGD)
Our algorithms significantly outperform the conventional federated averaging algorithm in terms of test accuracy, and also preserve robustness for the variance of critical parameters.
arXiv Detail & Related papers (2020-02-18T09:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.