Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries
- URL: http://arxiv.org/abs/2102.00504v1
- Date: Sun, 31 Jan 2021 18:00:29 GMT
- Title: Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries
- Authors: Marco Bressan, Nicol\`o Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice
- Abstract summary: We investigate the problem of exact cluster dependence using oracle queries.
We show that if we are allowed to make $(k2 + k )$, an additional queries can recover different different and arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary
- Score: 22.672233769934845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the problem of exact cluster recovery using oracle queries.
Previous results show that clusters in Euclidean spaces that are convex and
separated with a margin can be reconstructed exactly using only $O(\log n)$
same-cluster queries, where $n$ is the number of input points. In this work, we
study this problem in the more challenging non-convex setting. We introduce a
structural characterization of clusters, called $(\beta,\gamma)$-convexity,
that can be applied to any finite set of points equipped with a metric (or even
a semimetric, as the triangle inequality is not needed). Using
$(\beta,\gamma)$-convexity, we can translate natural density properties of
clusters (which include, for instance, clusters that are strongly non-convex in
$R^d$) into a graph-theoretic notion of convexity. By exploiting this convexity
notion, we design a deterministic algorithm that recovers
$(\beta,\gamma)$-convex clusters using $O(k^2 \log n + k^2
(\frac{6}{\beta\gamma})^{dens(X)})$ same-cluster queries, where $k$ is the
number of clusters and $dens(X)$ is the density dimension of the semimetric. We
show that an exponential dependence on the density dimension is necessary, and
we also show that, if we are allowed to make $O(k^2 + k \log n)$ additional
queries to a "cluster separation" oracle, then we can recover clusters that
have different and arbitrary scales, even when the scale of each cluster is
unknown.
Related papers
- Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - On Margin-Based Cluster Recovery with Oracle Queries [22.672233769934845]
We study an active cluster recovery problem where, given a set of $n$ points oracle and an answering queries like "are these two points in the same cluster?"
We give an algorithm that recovers arbitrary convex clusters in exactly time, and with a number of queries that is lower than the best existing algorithm by $Theta(mm)$ factors.
For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(log n)$ query bound, and is provably near optimal.
arXiv Detail & Related papers (2021-06-09T08:48:23Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure.
We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z) - K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles.
We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z) - Exact Recovery of Mangled Clusters with Same-Cluster Queries [20.03712152278538]
We study the cluster recovery problem in the semi-supervised active clustering framework.
We design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k3 ln k ln n)$ oracle queries and $tildeO(kn + k3)$ time to recover the clustering with zero misclassification error.
arXiv Detail & Related papers (2020-06-08T15:27:58Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.