SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines
- URL: http://arxiv.org/abs/2303.03132v2
- Date: Fri, 23 Jun 2023 12:31:35 GMT
- Title: SC-Block: Supervised Contrastive Blocking within Entity Resolution
Pipelines
- Authors: Alexander Brinkmann, Roee Shraga, Christian Bizer
- Abstract summary: This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space.
We benchmark SC-Block against eight state-of-the-art blocking methods.
For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher.
- Score: 75.5113002732746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of entity resolution is to identify records in multiple datasets
that represent the same real-world entity. However, comparing all records
across datasets can be computationally intensive, leading to long runtimes. To
reduce these runtimes, entity resolution pipelines are constructed of two
parts: a blocker that applies a computationally cheap method to select
candidate record pairs, and a matcher that afterwards identifies matching pairs
from this set using more expensive methods. This paper presents SC-Block, a
blocking method that utilizes supervised contrastive learning for positioning
records in the embedding space, and nearest neighbour search for candidate set
building. We benchmark SC-Block against eight state-of-the-art blocking
methods. In order to relate the training time of SC-Block to the reduction of
the overall runtime of the entity resolution pipeline, we combine SC-Block with
four matching methods into complete pipelines. For measuring the overall
runtime, we determine candidate sets with 99.5% pair completeness and pass them
to the matcher. The results show that SC-Block is able to create smaller
candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster
compared to pipelines with other blockers, without sacrificing F1 score.
Blockers are often evaluated using relatively small datasets which might lead
to runtime effects resulting from a large vocabulary size being overlooked. In
order to measure runtimes in a more challenging setting, we introduce a new
benchmark dataset that requires large numbers of product offers to be blocked.
On this large-scale benchmark dataset, pipelines utilizing SC-Block and the
best-performing matcher execute 8 times faster than pipelines utilizing another
blocker with the same matcher reducing the runtime from 2.5 hours to 18
minutes, clearly compensating for the 5 minutes required for training SC-Block.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Block-Attention for Efficient RAG [3.926246435703829]
Block-Attention addresses the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios.
By defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before.
Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models.
arXiv Detail & Related papers (2024-09-14T02:34:26Z) - FastGAS: Fast Graph-based Annotation Selection for In-Context Learning [53.17606395275021]
In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts.
Existing methods have proposed to select a subset of unlabeled examples for annotation.
We propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances.
arXiv Detail & Related papers (2024-06-06T04:05:54Z) - Pipeline Parallelism with Controllable Memory [6.135123843073223]
We show that almost all existing pipeline schedules are memory inefficient.
We introduce a family of memory efficient building blocks with controllable activation memory.
We can achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B.
arXiv Detail & Related papers (2024-05-24T08:54:36Z) - Parsimonious Optimal Dynamic Partial Order Reduction [1.5029560229270196]
We present Parsimonious-OPtimal DPOR (POP), an optimal DPOR algorithm for analyzing multi-threaded programs under sequential consistency.
POP combines several novel algorithmic techniques, including (i) a parsimonious race reversal strategy, which avoids multiple reversals of the same race.
Our implementation in Nidhugg shows that these techniques can significantly speed up the analysis of concurrent programs, and do so with low memory consumption.
arXiv Detail & Related papers (2024-05-18T00:07:26Z) - ShallowBlocker: Improving Set Similarity Joins for Blocking [1.8492669447784602]
We propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker.
It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter.
We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.
arXiv Detail & Related papers (2023-12-26T00:31:43Z) - Pipe-BD: Pipelined Parallel Blockwise Distillation [7.367308544773381]
We propose Pipe-BD, a novel parallelization method for blockwise distillation.
Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation.
We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
arXiv Detail & Related papers (2023-01-29T13:38:43Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Beta-CROWN: Efficient Bound Propagation with Per-neuron Split
Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits.
$beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification.
By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z) - Distillation Guided Residual Learning for Binary Convolutional Neural
Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN)
We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.
To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN.
This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.