How to Find a Good Explanation for Clustering?
- URL: http://arxiv.org/abs/2112.06580v2
- Date: Thu, 16 Dec 2021 15:16:18 GMT
- Title: How to Find a Good Explanation for Clustering?
- Authors: Sayan Bandyapadhyay, Fedor V. Fomin, Petr A. Golovach, William Lochet,
Nidhi Purohit, Kirill Simonov
- Abstract summary: Moshkovitz, Dasgupta, Rashtchian, and Frost [ICML 2020] proposed an elegant model of explainable $k$-means and $k$-median clustering.
We study two natural algorithmic questions about explainable clustering.
Our rigorous algorithmic analysis sheds some light on the influence of parameters like the input size, dimension of the data, the number of outliers, the number of clusters, and the approximation ratio, on the computational complexity of explainable clustering.
- Score: 7.951746797489421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $k$-means and $k$-median clustering are powerful unsupervised machine
learning techniques. However, due to complicated dependences on all the
features, it is challenging to interpret the resulting cluster assignments.
Moshkovitz, Dasgupta, Rashtchian, and Frost [ICML 2020] proposed an elegant
model of explainable $k$-means and $k$-median clustering. In this model, a
decision tree with $k$ leaves provides a straightforward characterization of
the data set into clusters.
We study two natural algorithmic questions about explainable clustering. (1)
For a given clustering, how to find the "best explanation" by using a decision
tree with $k$ leaves? (2) For a given set of points, how to find a decision
tree with $k$ leaves minimizing the $k$-means/median objective of the resulting
explainable clustering? To address the first question, we introduce a new model
of explainable clustering. Our model, inspired by the notion of outliers in
robust statistics, is the following. We are seeking a small number of points
(outliers) whose removal makes the existing clustering well-explainable. For
addressing the second question, we initiate the study of the model of
Moshkovitz et al. from the perspective of multivariate complexity. Our rigorous
algorithmic analysis sheds some light on the influence of parameters like the
input size, dimension of the data, the number of outliers, the number of
clusters, and the approximation ratio, on the computational complexity of
explainable clustering.
Related papers
- Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Explainable k-means. Don't be greedy, plant bigger trees! [12.68470213641421]
We provide a new bi-criteria $tildeO(log2 k)$ competitive algorithm for explainable $k$-means clustering.
Explainable $k$-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020)
arXiv Detail & Related papers (2021-11-04T23:15:17Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Exact Recovery of Mangled Clusters with Same-Cluster Queries [20.03712152278538]
We study the cluster recovery problem in the semi-supervised active clustering framework.
We design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k3 ln k ln n)$ oracle queries and $tildeO(kn + k3)$ time to recover the clustering with zero misclassification error.
arXiv Detail & Related papers (2020-06-08T15:27:58Z) - ExKMC: Expanding Explainable $k$-Means Clustering [19.702066333512548]
We study algorithms for $k$-means clustering, focusing on a trade-off between explainability and accuracy.
We develop a new explainable $k$-means clustering algorithm, ExKMC, that takes an additional parameter $k' geq k$ and outputs a decision tree with $k'$ leaves.
ExKMC produces a low cost clustering, outperforming both standard decision tree methods and other algorithms for explainable clustering.
arXiv Detail & Related papers (2020-06-03T17:14:55Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.