LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew
- URL: http://arxiv.org/abs/2003.02972v1
- Date: Fri, 6 Mar 2020 00:06:20 GMT
- Title: LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew
- Authors: Cyrus Rashtchian, Aneesh Sharma, David P. Woodruff
- Abstract summary: All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
- Score: 58.21885402826496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: All-pairs set similarity is a widely used data mining task, even for large
and high-dimensional datasets. Traditionally, similarity search has focused on
discovering very similar pairs, for which a variety of efficient algorithms are
known. However, recent work highlights the importance of finding pairs of sets
with relatively small intersection sizes. For example, in a recommender system,
two users may be alike even though their interests only overlap on a small
percentage of items. In such systems, some dimensions are often highly skewed
because they are very popular. Together these two properties render previous
approaches infeasible for large input sizes. To address this problem, we
present a new distributed algorithm, LSF-Join, for approximate all-pairs set
similarity. The core of our algorithm is a randomized selection procedure based
on Locality Sensitive Filtering. Our method deviates from prior approximate
algorithms, which are based on Locality Sensitive Hashing. Theoretically, we
show that LSF-Join efficiently finds most close pairs, even for small
similarity thresholds and for skewed input sets. We prove guarantees on the
communication, work, and maximum load of LSF-Join, and we also experimentally
demonstrate its accuracy on multiple graphs.
Related papers
- PECANN: Parallel Efficient Clustering with Graph-Based Approximate
Nearest Neighbor Search [8.15681999722805]
This paper studies density-based clustering of point sets.
It unifies the different variants of density peaks clustering into a single framework, PECANN.
We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z) - ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate
Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms.
We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z) - Scalable Batch Acquisition for Deep Bayesian Active Learning [70.68403899432198]
In deep active learning, it is important to choose multiple examples to markup at each step.
Existing solutions to this problem, such as BatchBALD, have significant limitations in selecting a large number of examples.
We present the Large BatchBALD algorithm, which aims to achieve comparable quality while being more computationally efficient.
arXiv Detail & Related papers (2023-01-13T11:45:17Z) - Parallel Instance Filtering for Malware Detection [0.0]
This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF)
The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset.
We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples.
arXiv Detail & Related papers (2022-06-28T11:14:20Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - Approximate Nearest Neighbor Search under Neural Similarity Metric for
Large-Scale Recommendation [20.42993976179691]
We propose a novel method to extend ANN search to arbitrary matching functions.
Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items.
The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase.
arXiv Detail & Related papers (2022-02-14T07:55:57Z) - SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings [18.916058638077274]
This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions.
We propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN.
We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets.
arXiv Detail & Related papers (2021-12-11T00:10:05Z) - Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers.
We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.