Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering
- URL: http://arxiv.org/abs/2002.11661v3
- Date: Thu, 22 Oct 2020 15:18:02 GMT
- Title: Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering
- Authors: Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Ji-Ah Lee,
Patrick Flaherty, Kyle Cranmer, Andrew McGregor, Andrew McCallum
- Abstract summary: We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
- Score: 41.24805506595378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hierarchical clustering is a fundamental task often used to discover
meaningful structures in data, such as phylogenetic trees, taxonomies of
concepts, subtypes of cancer, and cascades of particle decays in particle
physics. Typically approximate algorithms are used for inference due to the
combinatorial number of possible hierarchical clusterings. In contrast to
existing methods, we present novel dynamic-programming algorithms for
\emph{exact} inference in hierarchical clustering based on a novel trellis data
structure, and we prove that we can exactly compute the partition function,
maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and
clusters. Our algorithms scale in time and space proportional to the powerset
of $N$ elements which is super-exponentially more efficient than explicitly
considering each of the (2N-3)!! possible hierarchies. Also, for larger
datasets where our exact algorithms become infeasible, we introduce an
approximate algorithm based on a sparse trellis that compares well to other
benchmarks. Exact methods are relevant to data analyses in particle physics and
for finding correlations among gene expression in cancer genomics, and we give
examples in both areas, where our algorithms outperform greedy and beam search
baselines. In addition, we consider Dasgupta's cost with synthetic data.
Related papers
- FLASC: A Flare-Sensitive Clustering Algorithm [0.0]
We present FLASC, an algorithm that detects branches within clusters to identify subpopulations.
Two variants of the algorithm are presented, which trade computational cost for noise robustness.
We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs.
arXiv Detail & Related papers (2023-11-27T14:55:16Z) - Fast Approximation of Similarity Graphs with Kernel Density Estimation [12.321755440062732]
We present a new algorithm for constructing a similarity graph from a set $X$ of data points in $mathbbRd$.
Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions.
arXiv Detail & Related papers (2023-10-21T00:32:47Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Natural Hierarchical Cluster Analysis by Nearest Neighbors with
Near-Linear Time Complexity [0.0]
We propose a nearest neighbor based clustering algorithm that results in a naturally defined hierarchy of clusters.
In contrast to the agglomerative and divisive hierarchical clustering algorithms, our approach is not dependent on the iterative working of the algorithm.
arXiv Detail & Related papers (2022-03-15T16:03:42Z) - Hierarchical Clustering: $O(1)$-Approximation for Well-Clustered Graphs [3.2901541059183432]
We present two-time approximation algorithms for hierarchical clustering.
The significance of our work is demonstrated by the empirical analysis on both synthetic and real-world data sets.
arXiv Detail & Related papers (2021-12-16T17:52:04Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Structured Graph Learning for Clustering and Semi-supervised
Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data.
Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure.
Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z) - Hierarchical clustering of bipartite data sets based on the statistical
significance of coincidences [0.0]
We provide an hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities is due to mere chance.
The algorithm performance is $O(n2)$ when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities.
arXiv Detail & Related papers (2020-04-27T23:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.