On Margin-Based Cluster Recovery with Oracle Queries
- URL: http://arxiv.org/abs/2106.04913v1
- Date: Wed, 9 Jun 2021 08:48:23 GMT
- Title: On Margin-Based Cluster Recovery with Oracle Queries
- Authors: Marco Bressan, Nicol\`o Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice
- Abstract summary: We study an active cluster recovery problem where, given a set of $n$ points oracle and an answering queries like "are these two points in the same cluster?"
We give an algorithm that recovers arbitrary convex clusters in exactly time, and with a number of queries that is lower than the best existing algorithm by $Theta(mm)$ factors.
For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(log n)$ query bound, and is provably near optimal.
- Score: 22.672233769934845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study an active cluster recovery problem where, given a set of $n$ points
and an oracle answering queries like "are these two points in the same
cluster?", the task is to recover exactly all clusters using as few queries as
possible. We begin by introducing a simple but general notion of margin between
clusters that captures, as special cases, the margins used in previous work,
the classic SVM margin, and standard notions of stability for center-based
clusterings. Then, under our margin assumptions we design algorithms that, in a
variety of settings, recover all clusters exactly using only $O(\log n)$
queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that
recovers arbitrary convex clusters, in polynomial time, and with a number of
queries that is lower than the best existing algorithm by $\Theta(m^m)$
factors. For general pseudometric spaces, where clusters might not be convex or
might not have any notion of shape, we give an algorithm that achieves the
$O(\log n)$ query bound, and is provably near-optimal as a function of the
packing number of the space. Finally, for clusterings realized by binary
concept classes, we give a combinatorial characterization of recoverability
with $O(\log n)$ queries, and we show that, for many concept classes in
Euclidean spaces, this characterization is equivalent to our margin condition.
Our results show a deep connection between cluster margins and active cluster
recoverability.
Related papers
- A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle [9.578056676899203]
oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
arXiv Detail & Related papers (2022-02-17T08:51:19Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles.
We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z) - Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries [22.672233769934845]
We investigate the problem of exact cluster dependence using oracle queries.
We show that if we are allowed to make $(k2 + k )$, an additional queries can recover different different and arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary
arXiv Detail & Related papers (2021-01-31T18:00:29Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Exact Recovery of Mangled Clusters with Same-Cluster Queries [20.03712152278538]
We study the cluster recovery problem in the semi-supervised active clustering framework.
We design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k3 ln k ln n)$ oracle queries and $tildeO(kn + k3)$ time to recover the clustering with zero misclassification error.
arXiv Detail & Related papers (2020-06-08T15:27:58Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.