Hierarchical Clustering: New Bounds and Objective
- URL: http://arxiv.org/abs/2111.06863v1
- Date: Fri, 12 Nov 2021 18:32:23 GMT
- Title: Hierarchical Clustering: New Bounds and Objective
- Authors: Mirmahdi Rahgoshay and Mohammad R. Salavatipour
- Abstract summary: We introduce a new objective function for dissimilarity-based clustering.
For any tree $T$, let $H_i,j$ be the number of $i$ and $j$'s common ancestors.
- Score: 1.1295032417617457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hierarchical Clustering has been studied and used extensively as a method for
analysis of data. More recently, Dasgupta [2016] defined a precise objective
function. Given a set of $n$ data points with a weight function $w_{i,j}$ for
each two items $i$ and $j$ denoting their similarity/dis-similarity, the goal
is to build a recursive (tree like) partitioning of the data points (items)
into successively smaller clusters. He defined a cost function for a tree $T$
to be $Cost(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times |T_{i,j}| \big)$ where
$T_{i,j}$ is the subtree rooted at the least common ancestor of $i$ and $j$ and
presented the first approximation algorithm for such clustering. Then Moseley
and Wang [2017] considered the dual of Dasgupta's objective function for
similarity-based weights and showed that both random partitioning and average
linkage have approximation ratio $1/3$ which has been improved in a series of
works to $0.585$ [Alon et al. 2020]. Later Cohen-Addad et al. [2019] considered
the same objective function as Dasgupta's but for dissimilarity-based metrics,
called $Rev(T)$. It is shown that both random partitioning and average linkage
have ratio $2/3$ which has been only slightly improved to $0.667078$ [Charikar
et al. SODA2020]. Our first main result is to consider $Rev(T)$ and present a
more delicate algorithm and careful analysis that achieves approximation
$0.71604$. We also introduce a new objective function for dissimilarity-based
clustering. For any tree $T$, let $H_{i,j}$ be the number of $i$ and $j$'s
common ancestors. Intuitively, items that are similar are expected to remain
within the same cluster as deep as possible. So, for dissimilarity-based
metrics, we suggest the cost of each tree $T$, which we want to minimize, to be
$Cost_H(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times H_{i,j} \big)$. We present
a $1.3977$-approximation for this objective.
Related papers
- A Theory of Interpretable Approximations [61.90216959710842]
We study the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $mathcalH$.
For any given pair of $mathcalH$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $mathcalH$ with arbitrary accuracy.
We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-
arXiv Detail & Related papers (2024-06-15T06:43:45Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Approximating Fair Clustering with Cascaded Norm Objectives [10.69111036810888]
We find a clustering which minimizes the $ell_q$-norm of the vector over $W$ of the $ell_p$-norms of the weighted distances of points in $P$ from the centers.
This generalizes various clustering problems, including Socially Fair $k$-Median and $k$-Means.
arXiv Detail & Related papers (2021-11-08T20:18:10Z) - Near-optimal Algorithms for Explainable k-Medians and k-Means [12.68470213641421]
We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian(ICML 2020)
Our goal is to find a emphoptimal decision tree that partitions data into $k clusters and minimizes the $k-medians or $k-means objective.
arXiv Detail & Related papers (2021-07-02T02:07:12Z) - Nearly-Tight and Oblivious Algorithms for Explainable Clustering [8.071379672971542]
We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020)
A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node data points with a threshold cut in a single dimension (feature)
We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(log2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective.
arXiv Detail & Related papers (2021-06-30T15:49:41Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms.
We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling"
We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.