Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution
- URL: http://arxiv.org/abs/2509.17470v2
- Date: Fri, 24 Oct 2025 17:04:17 GMT
- Title: Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution
- Authors: Mohammadreza Sharifi, Danial Ahmadzadeh,
- Abstract summary: We introduce a scalable hybrid framework, which is designed to address several important problems.<n>We utilize a pre-trained language model to encode each structured data into corresponding semantic embedding vectors.<n>After retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.
Related papers
- Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z) - Query as Anchor: Scenario-Adaptive User Representation via Large Language Model [28.30329175937291]
We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis.<n>We first construct UserU, an industrial-scale pre-training dataset that aligns behavioral sequences with user understanding semantics.<n>We introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures.<n>For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency.
arXiv Detail & Related papers (2026-02-16T06:09:31Z) - CREAM: Continual Retrieval on Dynamic Streaming Corpora with Adaptive Soft Memory [19.64051996386645]
CREAM is a self-supervised framework for memory-based continual retrieval.<n>It adapts to both seen and unseen topics in an unsupervised setting.<n> Experiments on two benchmark datasets demonstrate that CREAM exhibits superior adaptability and retrieval accuracy.
arXiv Detail & Related papers (2026-01-06T04:47:49Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - A Simple and Effective Framework for Symmetric Consistent Indexing in Large-Scale Dense Retrieval [11.72564658353791]
Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy.<n>The widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency.<n>This paper proposes a simple and effective framework named SCI comprising two synergistic modules.<n>We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.
arXiv Detail & Related papers (2025-12-15T08:11:24Z) - Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z) - Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations [0.0]
We propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k.<n>Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content.<n>Our framework is simple to implement and easily integrates with existing RAG pipelines, making it a practical solution for real-world applications under limited resources.
arXiv Detail & Related papers (2025-06-16T15:34:29Z) - Online federated learning framework for classification [7.613977984287604]
We develop a novel online federated learning framework for classification.<n>We handle streaming data from multiple clients while ensuring data privacy and computational efficiency.<n>Our approach delivers high classification accuracy, significant computational efficiency gains, and substantial savings in data storage requirements compared to existing methods.
arXiv Detail & Related papers (2025-03-19T13:50:19Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank.
Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z) - Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain
Detection [60.88952532574564]
This paper conducts a thorough comparison of out-of-domain intent detection methods.
We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification.
Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results.
arXiv Detail & Related papers (2021-01-11T09:10:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.