Multidimensional Assignment Problem for multipartite entity resolution
- URL: http://arxiv.org/abs/2112.03346v1
- Date: Mon, 6 Dec 2021 20:34:55 GMT
- Title: Multidimensional Assignment Problem for multipartite entity resolution
- Authors: Alla Kammerdiner and Alexander Semenov and Eduardo Pasiliao
- Abstract summary: Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
- Score: 69.48568967931608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multipartite entity resolution aims at integrating records from multiple
datasets into one entity. We derive a mathematical formulation for a general
class of record linkage problems in multipartite entity resolution across many
datasets as a combinatorial optimization problem known as the multidimensional
assignment problem. As a motivation for our approach, we illustrate the
advantage of multipartite entity resolution over sequential bipartite matching.
Because the optimization problem is NP-hard, we apply two heuristic procedures,
a Greedy algorithm and very large scale neighborhood search, to solve the
assignment problem and find the most likely matching of records from multiple
datasets into a single entity. We evaluate and compare the performance of these
algorithms and their modifications on synthetically generated data. We perform
computational experiments to compare performance of recent heuristic, the very
large-scale neighborhood search, with a Greedy algorithm, another heuristic for
the MAP, as well as with two versions of genetic algorithm, a general
metaheuristic. Importantly, we perform experiments to compare two alternative
methods of re-starting the search for the former heuristic, specifically a
random-sampling multi-start and a deterministic design-based multi-start. We
find evidence that design-based multi-start can be more efficient as the size
of databases grow large. In addition, we show that very large scale search,
especially its multi-start version, outperforms simple Greedy heuristic.
Hybridization of Greedy search with very large scale neighborhood search
improves the performance. Using multi-start with as few as three additional
runs of very large scale search offers some improvement in the performance of
the very large scale search procedure. Last, we propose an approach to
evaluating complexity of the very large-scale neighborhood search.
Related papers
- Memetic collaborative approaches for finding balanced incomplete block designs [0.0]
The balanced incomplete block design (BIBD) problem is a difficult problem with a large number of symmetries.
We propose a dual (integer) problem representation that serves as an alternative to the classical binary formulation of the problem.
arXiv Detail & Related papers (2024-11-04T16:41:18Z) - Efficient Retrieval with Learned Similarities [2.729516456192901]
State-of-the-art retrieval algorithms have migrated to learned similarities.
We show that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions.
MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency.
arXiv Detail & Related papers (2024-07-22T08:19:34Z) - Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization [52.80408805368928]
We introduce a novel greedy-style subset selection algorithm for batch acquisition.
Our experiments on the red fluorescent proteins show that our proposed method achieves the baseline performance in 1.69x fewer queries.
arXiv Detail & Related papers (2024-06-21T05:57:08Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Greedy Relaxations of the Sparsest Permutation Algorithm [4.125187280299247]
We develop a class of algorithms, namely GRaSP, that are efficient and pointwise consistent under increasingly weaker assumptions than faithfulness.
The most relaxed form of GRaSP outperforms many state-of-the-art causal search algorithms in simulation.
arXiv Detail & Related papers (2022-06-11T05:00:36Z) - Frequent Itemset-driven Search for Finding Minimum Node Separators in
Complex Networks [61.2383572324176]
We propose a frequent itemset-driven search approach, which integrates the concept of frequent itemset mining in data mining into the well-known memetic search framework.
It iteratively employs the frequent itemset recombination operator to generate promising offspring solution based on itemsets that frequently occur in high-quality solutions.
In particular, it discovers 29 new upper bounds and matches 18 previous best-known bounds.
arXiv Detail & Related papers (2022-01-18T11:16:40Z) - Expressivity of Parameterized and Data-driven Representations in Quality
Diversity Search [111.06379262544911]
We compare the output diversity of a quality diversity evolutionary search performed in two different search spaces.
A learned model is better at interpolating between known data points than at extrapolating or expanding towards unseen examples.
arXiv Detail & Related papers (2021-05-10T10:27:43Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Optimizing Revenue while showing Relevant Assortments at Scale [1.0200170217746136]
Real-time assortment optimization has become essential in e-commerce operations.
We design fast and flexible algorithms that find the optimal assortment in difficult regimes.
Empirical validations using a real world dataset show that our algorithms are competitive even when the number of items is $sim 105$ ($10times$ larger instances than previously studied)
arXiv Detail & Related papers (2020-03-06T20:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.