Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
- URL: http://arxiv.org/abs/2106.05610v1
- Date: Thu, 10 Jun 2021 09:29:05 GMT
- Title: Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
- Authors: Laxman Dhulipala, David Eisenstat, Jakub {\L}\k{a}cki, Vahab Mirrokni,
Jessica Shi
- Abstract summary: We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs.
We define an algorithmic framework for hierarchical agglomerative graph clustering.
We show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x.
- Score: 1.5644420658691407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the widely used hierarchical agglomerative clustering (HAC)
algorithm on edge-weighted graphs. We define an algorithmic framework for
hierarchical agglomerative graph clustering that provides the first efficient
$\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as
complete- and WPGMA-linkage, as well as other measures. Furthermore, for
average-linkage, arguably the most popular variant of HAC, we provide an
algorithm that runs in $\tilde{O}(n\sqrt{m})$ time. For this variant, this is
the first exact algorithm that runs in subquadratic time, as long as
$m=n^{2-\epsilon}$ for some constant $\epsilon > 0$. We complement this result
with a simple $\epsilon$-close approximation algorithm for average-linkage in
our framework that runs in $\tilde{O}(m)$ time. As an application of our
algorithms, we consider clustering points in a metric space by first using
$k$-NN to generate a graph from the point set, and then running our algorithms
on the resulting weighted graph. We validate the performance of our algorithms
on publicly available datasets, and show that our approach can speed up
clustering of point datasets by a factor of 20.7--76.5x.
Related papers
- Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.
In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z) - Ensemble Quadratic Assignment Network for Graph Matching [52.20001802006391]
Graph matching is a commonly used technique in computer vision and pattern recognition.
Recent data-driven approaches have improved the graph matching accuracy remarkably.
We propose a graph neural network (GNN) based approach to combine the advantages of data-driven and traditional methods.
arXiv Detail & Related papers (2024-03-11T06:34:05Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Fast and Simple Spectral Clustering in Theory and Practice [7.070726553564701]
Spectral clustering is a popular and effective algorithm designed to find $k$ clusters in a graph $G$.
We present a simple spectral clustering algorithm based on a vertices embedding with $O(log(k))$ computed by the power method.
We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.
arXiv Detail & Related papers (2023-10-17T02:31:57Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Near-Optimal Quantum Coreset Construction Algorithms for Clustering [15.513270929560088]
We give quantum algorithms that find coresets for $k$-clustering in $mathbbRd$ with $tildeO(sqrtnkd3/2)$ query complexity.
Our coreset reduces the input size from $n$ to $mathrmpoly(kepsilon-1d)$, so that existing $alpha$-approximation algorithms for clustering can run on top of it.
arXiv Detail & Related papers (2023-06-05T12:22:46Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Correlation Clustering Algorithm for Dynamic Complete Signed Graphs: An
Index-based Approach [9.13755431537592]
In this paper, we reduce the complexity of approximating the correlation clustering problem from $O(mtimesleft( 2+ alpha (G) right)+n)$ to $O(m+n)$ for any given value of $varepsilon$ for a complete signed graph.
Our approach gives the same output as the original algorithm and makes it possible to implement the algorithm in a full dynamic setting.
arXiv Detail & Related papers (2023-01-01T10:57:36Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - Sublinear Time and Space Algorithms for Correlation Clustering via
Sparse-Dense Decompositions [9.29659220237395]
We present a new approach for solving (minimum disagreement) correlation clustering.
We obtain sublinear algorithms with highly efficient time and space complexity.
The main ingredient of our approach is a novel connection to sparse-dense graph decompositions.
arXiv Detail & Related papers (2021-09-29T16:25:02Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.