Shallow decision trees for explainable $k$-means clustering
- URL: http://arxiv.org/abs/2112.14718v1
- Date: Wed, 29 Dec 2021 18:22:28 GMT
- Title: Shallow decision trees for explainable $k$-means clustering
- Authors: Eduardo Laber, Lucas Murtinho, Felipe Oliveira
- Abstract summary: We propose an efficient algorithm that takes into account metrics related to the depths of the leaves in a decision tree.
In experiments on 16 datasets, our algorithm yields better results than decision-tree clustering algorithms.
- Score: 1.2891210250935146
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A number of recent works have employed decision trees for the construction of
explainable partitions that aim to minimize the $k$-means cost function. These
works, however, largely ignore metrics related to the depths of the leaves in
the resulting tree, which is perhaps surprising considering how the
explainability of a decision tree depends on these depths. To fill this gap in
the literature, we propose an efficient algorithm that takes into account these
metrics. In experiments on 16 datasets, our algorithm yields better results
than decision-tree clustering algorithms such as the ones presented in
\cite{dasgupta2020explainable}, \cite{frost2020exkmc}, \cite{laber2021price}
and \cite{DBLP:conf/icml/MakarychevS21}, typically achieving lower or
equivalent costs with considerably shallower trees. We also show, through a
simple adaptation of existing techniques, that the problem of building
explainable partitions induced by binary trees for the $k$-means cost function
does not admit an $(1+\epsilon)$-approximation in polynomial time unless
$P=NP$, which justifies the quest for approximation algorithms and/or
heuristics.
Related papers
- LiteSearch: Efficacious Tree Search for LLM [70.29796112457662]
This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget.
Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach enjoys significantly lower computational costs compared to baseline methods.
arXiv Detail & Related papers (2024-06-29T05:14:04Z) - Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - On Computing Optimal Tree Ensembles [7.424944196676223]
Random forests and, more generally, (decisionnobreakdash-)tree ensembles are widely used methods for classification and regression.
Recent algorithmic advances allow to compute decision trees that are optimal for various measures such as their size or depth.
We provide two novel algorithms and corresponding lower bounds.
arXiv Detail & Related papers (2023-06-07T13:30:43Z) - Unbiased and Efficient Sampling of Dependency Trees [0.0]
Most treebanks require that every valid dependency tree has a single edge coming out of the ROOT node.
Zmigrod et al. have recently proposed algorithms for sampling with and without replacement from the single-root dependency tree distribution.
We show that their fastest algorithm for sampling with replacement, Wilson-RC, is in fact biased.
arXiv Detail & Related papers (2022-05-25T09:57:28Z) - Active-LATHE: An Active Learning Algorithm for Boosting the Error
Exponent for Learning Homogeneous Ising Trees [75.93186954061943]
We design and analyze an algorithm that boosts the error exponent by at least 40% when $rho$ is at least $0.8$.
Our analysis hinges on judiciously exploiting the minute but detectable statistical variation of the samples to allocate more data to parts of the graph.
arXiv Detail & Related papers (2021-10-27T10:45:21Z) - On Finding the $K$-best Non-projective Dependency Trees [39.5549669100436]
We provide a simplification of the $K$-best spanning tree algorithm of Camerini et al. (1980).
We present a novel extension of the algorithm for decoding the $K$-best dependency trees of a graph which are subject to a root constraint.
arXiv Detail & Related papers (2021-06-01T20:23:41Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - SGA: A Robust Algorithm for Partial Recovery of Tree-Structured
Graphical Models with Noisy Samples [75.32013242448151]
We consider learning Ising tree models when the observations from the nodes are corrupted by independent but non-identically distributed noise.
Katiyar et al. (2020) showed that although the exact tree structure cannot be recovered, one can recover a partial tree structure.
We propose Symmetrized Geometric Averaging (SGA), a more statistically robust algorithm for partial tree recovery.
arXiv Detail & Related papers (2021-01-22T01:57:35Z) - On the price of explainability for some clustering problems [1.52292571922932]
We provide upper and lower bounds for a natural model where explainability is achieved via decision trees.
For the $k$-means and $k$-medians problems our upper bounds improve those obtained by [Moshkovitz et. al, ICML 20] for low dimensions.
Another contribution is a simple and efficient algorithm for building explainable clusterings for the $k$-means problem.
arXiv Detail & Related papers (2021-01-05T15:08:25Z) - On $\ell_p$-norm Robustness of Ensemble Stumps and Trees [83.81523991945018]
We develop an efficient programming based algorithm for sound verification of ensemble stumps.
We demonstrate the first certified defense method for training ensemble stumps and trees with respect to $ell_p$ norm perturbations.
arXiv Detail & Related papers (2020-08-20T03:42:40Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.