Related papers: Multidimensional Assignment Problem for multipartite entity resolution

Multidimensional Assignment Problem for multipartite entity resolution

URL: http://arxiv.org/abs/2112.03346v1
Date: Mon, 6 Dec 2021 20:34:55 GMT
Title: Multidimensional Assignment Problem for multipartite entity resolution
Authors: Alla Kammerdiner and Alexander Semenov and Eduardo Pasiliao
Abstract summary: Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem. We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
Score: 69.48568967931608
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We derive a mathematical formulation for a general class of record linkage problems in multipartite entity resolution across many datasets as a combinatorial optimization problem known as the multidimensional assignment problem. As a motivation for our approach, we illustrate the advantage of multipartite entity resolution over sequential bipartite matching. Because the optimization problem is NP-hard, we apply two heuristic procedures, a Greedy algorithm and very large scale neighborhood search, to solve the assignment problem and find the most likely matching of records from multiple datasets into a single entity. We evaluate and compare the performance of these algorithms and their modifications on synthetically generated data. We perform computational experiments to compare performance of recent heuristic, the very large-scale neighborhood search, with a Greedy algorithm, another heuristic for the MAP, as well as with two versions of genetic algorithm, a general metaheuristic. Importantly, we perform experiments to compare two alternative methods of re-starting the search for the former heuristic, specifically a random-sampling multi-start and a deterministic design-based multi-start. We find evidence that design-based multi-start can be more efficient as the size of databases grow large. In addition, we show that very large scale search, especially its multi-start version, outperforms simple Greedy heuristic. Hybridization of Greedy search with very large scale neighborhood search improves the performance. Using multi-start with as few as three additional runs of very large scale search offers some improvement in the performance of the very large scale search procedure. Last, we propose an approach to evaluating complexity of the very large-scale neighborhood search.

Related papers

Gradient-Based Join Ordering [0.532836690371986]
Join ordering is the problem of selecting the most efficient sequence in which to evaluate joins in a database query.<n>Traditional approaches cast this problem as a discrete runtime search over binary trees guided by a cost model.<n>We show that, when the cost model is differentiable, the query plans can be continuously relaxed into a soft adjacency matrix.<n>Using a learned Graph Neural Network as the cost model, we demonstrate that this gradient-based approach can find comparable and even lower-cost plans.
arXiv Detail & Related papers (2025-11-18T13:24:28Z)
Memetic collaborative approaches for finding balanced incomplete block designs [0.0]
The balanced incomplete block design (BIBD) problem is a difficult problem with a large number of symmetries. We propose a dual (integer) problem representation that serves as an alternative to the classical binary formulation of the problem.
arXiv Detail & Related papers (2024-11-04T16:41:18Z)
Retrieval with Learned Similarities [2.729516456192901]
State-of-the-art retrieval algorithms have migrated to learned similarities. We show that Mixture-of-Logits (MoL) can be realized empirically to achieve superior performance on diverse retrieval scenarios.
arXiv Detail & Related papers (2024-07-22T08:19:34Z)
Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization [52.80408805368928]
We introduce a novel greedy-style subset selection algorithm for batch acquisition. Our experiments on the red fluorescent proteins show that our proposed method achieves the baseline performance in 1.69x fewer queries.
arXiv Detail & Related papers (2024-06-21T05:57:08Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Greedy Relaxations of the Sparsest Permutation Algorithm [4.125187280299247]
We develop a class of algorithms, namely GRaSP, that are efficient and pointwise consistent under increasingly weaker assumptions than faithfulness. The most relaxed form of GRaSP outperforms many state-of-the-art causal search algorithms in simulation.
arXiv Detail & Related papers (2022-06-11T05:00:36Z)
Frequent Itemset-driven Search for Finding Minimum Node Separators in Complex Networks [61.2383572324176]
We propose a frequent itemset-driven search approach, which integrates the concept of frequent itemset mining in data mining into the well-known memetic search framework. It iteratively employs the frequent itemset recombination operator to generate promising offspring solution based on itemsets that frequently occur in high-quality solutions. In particular, it discovers 29 new upper bounds and matches 18 previous best-known bounds.
arXiv Detail & Related papers (2022-01-18T11:16:40Z)
Expressivity of Parameterized and Data-driven Representations in Quality Diversity Search [111.06379262544911]
We compare the output diversity of a quality diversity evolutionary search performed in two different search spaces. A learned model is better at interpolating between known data points than at extrapolating or expanding towards unseen examples.
arXiv Detail & Related papers (2021-05-10T10:27:43Z)
Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models. The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning. We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z)
Learning to Accelerate Heuristic Searching for Large-Scale Maximum Weighted b-Matching Problems in Online Advertising [51.97494906131859]
Bipartite b-matching is fundamental in algorithm design, and has been widely applied into economic markets, labor markets, etc. Existing exact and approximate algorithms usually fail in such settings due to either requiring intolerable running time or too much computation resource. We propose textttNeuSearcher which leverages the knowledge learned from previously instances to solve new problem instances.
arXiv Detail & Related papers (2020-05-09T02:48:23Z)
Optimizing Revenue while showing Relevant Assortments at Scale [1.0200170217746136]
Real-time assortment optimization has become essential in e-commerce operations. We design fast and flexible algorithms that find the optimal assortment in difficult regimes. Empirical validations using a real world dataset show that our algorithms are competitive even when the number of items is $sim 105$ ($10times$ larger instances than previously studied)
arXiv Detail & Related papers (2020-03-06T20:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.