Towards Universal Dense Blocking for Entity Resolution
- URL: http://arxiv.org/abs/2404.14831v2
- Date: Thu, 25 Apr 2024 06:37:51 GMT
- Title: Towards Universal Dense Blocking for Entity Resolution
- Authors: Tianshu Wang, Hongyu Lin, Xianpei Han, Xiaoyang Chen, Boxi Cao, Le Sun,
- Abstract summary: We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus.
By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning.
Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
- Score: 49.06313308481536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
Related papers
- Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models [40.39823804602205]
Swordsman is an entropy-driven adaptive block-wise decoding framework for diffusion language models.<n>It partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries.<n>As a training-free framework, Swordsman demonstrates state-of-the-art performance across extensive evaluations.
arXiv Detail & Related papers (2026-02-04T10:27:49Z) - MI-PRUN: Optimize Large Language Model Pruning via Mutual Information [73.6518842907835]
We propose a mutual information based pruning method MI-PRUN for Large Language Models.<n>We leverage mutual information to identify redundant blocks by evaluating transitions in hidden states.<n>We also develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution.
arXiv Detail & Related papers (2026-01-12T05:06:01Z) - AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size [7.442463267121892]
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding.<n>This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding.<n>We introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime.
arXiv Detail & Related papers (2025-09-30T15:53:56Z) - Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z) - DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation [11.910667302899638]
DiffusionBlocks is a principled framework for transforming transformer-based networks into genuinely independent trainable blocks.<n>Our experiments on a range of transformer architectures demonstrate that DiffusionBlocks training matches the performance of end-to-end training.
arXiv Detail & Related papers (2025-06-17T05:44:18Z) - Improved Block Merging for 3D Point Cloud Instance Segmentation [6.632158868486343]
The proposed work improves over the state-of-the-art by allowing wrongly labelled points of already processed blocks to be corrected through label propagation.
Our experiments show that the proposed block merging algorithm significantly and consistently improves the obtained accuracy for all evaluation metrics employed in literature.
arXiv Detail & Related papers (2024-07-09T16:06:34Z) - Block Sparse Bayesian Learning: A Diversified Scheme [16.61484758008309]
We introduce a novel prior called Diversified Block Sparse Prior to characterize the widespread block sparsity phenomenon in real-world data.
By allowing diversification on intra-block variance and inter-block correlation matrices, we effectively address the sensitivity issue of existing block sparse learning methods to pre-defined block information.
arXiv Detail & Related papers (2024-02-07T08:18:06Z) - ShallowBlocker: Improving Set Similarity Joins for Blocking [1.8492669447784602]
We propose a hands-off blocking method based on classical string similarity measures: ShallowBlocker.
It uses a novel hybrid set similarity join combining absolute similarity, relative similarity, and local cardinality conditions with a new effective pre-candidate filter replacing size filter.
We show that the method achieves state-of-the-art pair effectiveness on both unsupervised and supervised blocking in a scalable way.
arXiv Detail & Related papers (2023-12-26T00:31:43Z) - Model Barrier: A Compact Un-Transferable Isolation Domain for Model
Intellectual Property Protection [52.08301776698373]
We propose a novel approach called Compact Un-Transferable Isolation Domain (CUTI-domain)
CUTI-domain acts as a barrier to block illegal transfers from authorized to unauthorized domains.
We show that CUTI-domain can be easily implemented as a plug-and-play module with different backbones.
arXiv Detail & Related papers (2023-03-20T13:07:11Z) - Decompose to Adapt: Cross-domain Object Detection via Feature
Disentanglement [79.2994130944482]
We design a Domain Disentanglement Faster-RCNN (DDF) to eliminate the source-specific information in the features for detection task learning.
Our DDF method facilitates the feature disentanglement at the global and local stages, with a Global Triplet Disentanglement (GTD) module and an Instance Similarity Disentanglement (ISD) module.
By outperforming state-of-the-art methods on four benchmark UDA object detection tasks, our DDF method is demonstrated to be effective with wide applicability.
arXiv Detail & Related papers (2022-01-06T05:43:01Z) - Generalizable Representation Learning for Mixture Domain Face
Anti-Spoofing [53.82826073959756]
Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios.
We propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels.
To overcome the limitation, we propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels.
arXiv Detail & Related papers (2021-05-06T06:04:59Z) - Stochastic Block-ADMM for Training Deep Networks [16.369102155752824]
We propose Block-ADMM as an approach to train deep neural networks in batch and online settings.
Our method works by splitting neural networks into an arbitrary number of blocks and utilizing auxiliary variables to connect these blocks.
We prove the convergence of our proposed method and justify its capabilities through experiments in supervised and weakly-supervised settings.
arXiv Detail & Related papers (2021-05-01T19:56:13Z) - Decentralized Swarm Collision Avoidance for Quadrotors via End-to-End
Reinforcement Learning [28.592704336574158]
We draw biological inspiration from flocks of starlings and apply the insight to end-to-end learned decentralized collision avoidance.
We propose a new, scalable observation model following a biomimetic topological interaction rule.
Our learned policies are tested in simulation and subsequently transferred to real-world drones to validate their real-world applicability.
arXiv Detail & Related papers (2021-04-30T11:19:03Z) - Attentive WaveBlock: Complementarity-enhanced Mutual Networks for
Unsupervised Domain Adaptation in Person Re-identification and Beyond [97.25179345878443]
This paper proposes a novel light-weight module, the Attentive WaveBlock (AWB)
AWB can be integrated into the dual networks of mutual learning to enhance the complementarity and further depress noise in the pseudo-labels.
Experiments demonstrate that the proposed method achieves state-of-the-art performance with significant improvements on multiple UDA person re-identification tasks.
arXiv Detail & Related papers (2020-06-11T15:40:40Z) - Contradictory Structure Learning for Semi-supervised Domain Adaptation [67.89665267469053]
Current adversarial adaptation methods attempt to align the cross-domain features.
Two challenges remain unsolved: 1) the conditional distribution mismatch and 2) the bias of the decision boundary towards the source domain.
We propose a novel framework for semi-supervised domain adaptation by unifying the learning of opposite structures.
arXiv Detail & Related papers (2020-02-06T22:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.