Exact and Approximate Hierarchical Clustering Using A*
- URL: http://arxiv.org/abs/2104.07061v1
- Date: Wed, 14 Apr 2021 18:15:27 GMT
- Title: Exact and Approximate Hierarchical Clustering Using A*
- Authors: Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava
Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer, Andrew
McCallum
- Abstract summary: We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
- Score: 51.187990314731344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hierarchical clustering is a critical task in numerous domains. Many
approaches are based on heuristics and the properties of the resulting
clusterings are studied post hoc. However, in several applications, there is a
natural cost function that can be used to characterize the quality of the
clustering. In those cases, hierarchical clustering can be seen as a
combinatorial optimization problem. To that end, we introduce a new approach
based on A* search. We overcome the prohibitively large search space by
combining A* with a novel \emph{trellis} data structure. This combination
results in an exact algorithm that scales beyond previous state of the art,
from a search space with $10^{12}$ trees to $10^{15}$ trees, and an approximate
algorithm that improves over baselines, even in enormous search spaces that
contain more than $10^{1000}$ trees. We empirically demonstrate that our method
achieves substantially higher quality results than baselines for a particle
physics use case and other clustering benchmarks. We describe how our method
provides significantly improved theoretical bounds on the time and space
complexity of A* for clustering.
Related papers
- Accelerating k-Means Clustering with Cover Trees [0.30693357740321775]
We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well.
We obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.
arXiv Detail & Related papers (2024-10-19T14:02:42Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Natural Hierarchical Cluster Analysis by Nearest Neighbors with
Near-Linear Time Complexity [0.0]
We propose a nearest neighbor based clustering algorithm that results in a naturally defined hierarchy of clusters.
In contrast to the agglomerative and divisive hierarchical clustering algorithms, our approach is not dependent on the iterative working of the algorithm.
arXiv Detail & Related papers (2022-03-15T16:03:42Z) - An Information-theoretic Perspective of Hierarchical Clustering [30.896561720088954]
A cost function for hierarchical clustering was introduced by Dasgupta citedasgupta2016cost.
In this paper, we investigate hierarchical clustering from the emphinformation-theoretic perspective and formulate a new objective function.
arXiv Detail & Related papers (2021-08-13T03:03:56Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z) - Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.