LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew
- URL: http://arxiv.org/abs/2003.02972v1
- Date: Fri, 6 Mar 2020 00:06:20 GMT
- Title: LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew
- Authors: Cyrus Rashtchian, Aneesh Sharma, David P. Woodruff
- Abstract summary: All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
- Score: 58.21885402826496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: All-pairs set similarity is a widely used data mining task, even for large
and high-dimensional datasets. Traditionally, similarity search has focused on
discovering very similar pairs, for which a variety of efficient algorithms are
known. However, recent work highlights the importance of finding pairs of sets
with relatively small intersection sizes. For example, in a recommender system,
two users may be alike even though their interests only overlap on a small
percentage of items. In such systems, some dimensions are often highly skewed
because they are very popular. Together these two properties render previous
approaches infeasible for large input sizes. To address this problem, we
present a new distributed algorithm, LSF-Join, for approximate all-pairs set
similarity. The core of our algorithm is a randomized selection procedure based
on Locality Sensitive Filtering. Our method deviates from prior approximate
algorithms, which are based on Locality Sensitive Hashing. Theoretically, we
show that LSF-Join efficiently finds most close pairs, even for small
similarity thresholds and for skewed input sets. We prove guarantees on the
communication, work, and maximum load of LSF-Join, and we also experimentally
demonstrate its accuracy on multiple graphs.
Related papers
- Scalable Private Partition Selection via Adaptive Weighting [66.09199304818928]
In a private set union, users hold subsets of items from an unbounded universe.
The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy.
We propose an algorithm for this problem, MaximumDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight.
arXiv Detail & Related papers (2025-02-13T01:27:11Z) - Billion-scale Similarity Search Using a Hybrid Indexing Approach with Advanced Filtering [49.1574468325115]
This paper presents a novel approach for similarity search with complex filtering capabilities on billion-scale datasets, optimized for CPU inference.
Our method extends the classical IVF-Flat index structure to integrate multi-dimensional filters.
The proposed algorithm combines dense embeddings with discrete filtering attributes, enabling fast retrieval in high-dimensional spaces.
arXiv Detail & Related papers (2025-01-23T07:47:00Z) - PECANN: Parallel Efficient Clustering with Graph-Based Approximate
Nearest Neighbor Search [8.15681999722805]
This paper studies density-based clustering of point sets.
It unifies the different variants of density peaks clustering into a single framework, PECANN.
We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z) - ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate
Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms.
We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z) - Scalable Batch Acquisition for Deep Bayesian Active Learning [70.68403899432198]
In deep active learning, it is important to choose multiple examples to markup at each step.
Existing solutions to this problem, such as BatchBALD, have significant limitations in selecting a large number of examples.
We present the Large BatchBALD algorithm, which aims to achieve comparable quality while being more computationally efficient.
arXiv Detail & Related papers (2023-01-13T11:45:17Z) - Parallel Instance Filtering for Malware Detection [0.0]
This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF)
The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset.
We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples.
arXiv Detail & Related papers (2022-06-28T11:14:20Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - Approximate Nearest Neighbor Search under Neural Similarity Metric for
Large-Scale Recommendation [20.42993976179691]
We propose a novel method to extend ANN search to arbitrary matching functions.
Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items.
The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase.
arXiv Detail & Related papers (2022-02-14T07:55:57Z) - SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings [18.916058638077274]
This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions.
We propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN.
We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets.
arXiv Detail & Related papers (2021-12-11T00:10:05Z) - Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.