ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering
using Nearest-Neighbor Chain
- URL: http://arxiv.org/abs/2106.04727v1
- Date: Tue, 8 Jun 2021 23:13:27 GMT
- Title: ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering
using Nearest-Neighbor Chain
- Authors: Shangdi Yu, Yiqiu Wang, Yan Gu, Laxman Dhulipala, Julian Shun
- Abstract summary: We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms.
Compared to most previous parallel HAC algorithms, our new algorithms require only linear memory, and are scalable to large data sets.
Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.
- Score: 6.824747267214373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the hierarchical clustering problem, where the goal is to
produce a dendrogram that represents clusters at varying scales of a data set.
We propose the ParChain framework for designing parallel hierarchical
agglomerative clustering (HAC) algorithms, and using the framework we obtain
novel parallel algorithms for the complete linkage, average linkage, and Ward's
linkage criteria. Compared to most previous parallel HAC algorithms, which
require quadratic memory, our new algorithms require only linear memory, and
are scalable to large data sets. ParChain is based on our parallelization of
the nearest-neighbor chain algorithm, and enables multiple clusters to be
merged on every round. We introduce two key optimizations that are critical for
efficiency: a range query optimization that reduces the number of distance
computations required when finding nearest neighbors of clusters, and a caching
optimization that stores a subset of previously computed distances, which are
likely to be reused.
Experimentally, we show that our highly-optimized implementations using 48
cores with two-way hyper-threading achieve 5.8--110.1x speedup over
state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x
self-relative speedup. Compared to state-of-the-art algorithms, our algorithms
require up to 237.3x less space. Our algorithms are able to scale to data set
sizes with tens of millions of points, which existing algorithms are not able
to handle.
Related papers
- Parallelization of the K-Means Algorithm with Applications to Big Data Clustering [0.23020018305241333]
The K-Means clustering using LLoyd's algorithm is an iterative approach to partition the given dataset into K different clusters.
We will compare two different approaches in this project.
arXiv Detail & Related papers (2024-05-20T14:18:36Z) - PECANN: Parallel Efficient Clustering with Graph-Based Approximate
Nearest Neighbor Search [8.15681999722805]
This paper studies density-based clustering of point sets.
It unifies the different variants of density peaks clustering into a single framework, PECANN.
We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Breaking 3-Factor Approximation for Correlation Clustering in
Polylogarithmic Rounds [0.23633885460047763]
We study parallel algorithms for the correlation clustering problem.
The goal is to partition the entities into clusters to minimize disagreements with labels.
Currently, all efficient parallel algorithms have an approximation ratio of at least 3.
We propose the first poly-logarithmic algorithm that achieves a better approximation ratio than 3.
arXiv Detail & Related papers (2023-07-13T12:32:49Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex
Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network.
We design two optimal algorithms that attain these lower bounds.
We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and
Hierarchical Spatial Clustering [6.4805900740861]
We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN$*$.
We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time)
Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.
arXiv Detail & Related papers (2021-04-02T16:05:00Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.