Clustering with Non-adaptive Subset Queries
- URL: http://arxiv.org/abs/2409.10908v1
- Date: Tue, 17 Sep 2024 05:56:07 GMT
- Title: Clustering with Non-adaptive Subset Queries
- Authors: Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha,
- Abstract summary: Given a query $S subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise.
For adaptive algorithms with pair-wise queries, the number of required queries is known to be $Theta(nk)$.
Non-adaptive schemes require $Omega(n2)$ queries, which matches the trivial $O(n2)$ upper bound attained by querying every pair of points.
- Score: 16.662507957069813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be $\Theta(nk)$, where $k$ is the number of clusters. However, non-adaptive schemes require $\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making $O(n \log k \cdot (\log k + \log\log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant. We also consider algorithms with a restricted query size of at most $s$. In this setting we prove that $\Omega(\max(n^2/s^2,n))$ queries are necessary and obtain algorithms making $\tilde{O}(n^2k/s^2)$ queries for any $s \leq \sqrt{n}$ and $\tilde{O}(n^2/s)$ queries for any $s \leq n$. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make $O(n \log k) + \tilde{O}(k)$ and $O(n\log^2 k)$ queries. Finally, allowing two rounds of adaptivity, we give an algorithm making $O(n \log k)$ queries in the general case and $O(n \log \log k)$ queries when the clusters are balanced.
Related papers
- Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Active Learning of Classifiers with Label and Seed Queries [18.34182076906661]
In the standard active learning setting, where only label queries are allowed, learning a classifier with strong convex hull margin $gamma$ requires in the worst case $Omegabig(k m log frac1gammabig)$ seed queries.
We show that, by carefully combining the two types of queries, a binary classifier can be learned in time $operatornamepoly(n+m)$ using only $O(m2 log n)$ label queries and $Obig(m log fracmgamma
arXiv Detail & Related papers (2022-09-08T18:46:23Z) - Optimal Clustering with Noisy Queries via Multi-Armed Bandit [19.052525950282234]
Motivated by many applications, we study clustering with a faulty oracle.
We propose a new time algorithm with $O(fracn)delta2 + textpoly(k,frac1delta, log n)$ queries.
Our main ingredient is an interesting connection between our problem and multi-armed bandit.
arXiv Detail & Related papers (2022-07-12T08:17:29Z) - The First Optimal Acceleration of High-Order Methods in Smooth Convex
Optimization [88.91190483500932]
We study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems.
The reason for this is that these algorithms require performing a complex binary procedure, which makes them neither optimal nor practical.
We fix this fundamental issue by providing the first algorithm with $mathcalOleft(epsilon-2/(p+1)right) $pth order oracle complexity.
arXiv Detail & Related papers (2022-05-19T16:04:40Z) - How to Query An Oracle? Efficient Strategies to Label Data [59.89900843097016]
We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning.
We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of $O(fracNk2)$.
In addition, we present an adaptive greedy query scheme, which achieves an average rate of $approx 0.2N$ queries per sample with triplet queries.
arXiv Detail & Related papers (2021-10-05T20:15:35Z) - Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with
a Faulty Oracle [7.449644976563424]
We propose an elegant theoretical model for studying clustering with a faulty oracle.
It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters.
We provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(log2 n)$) for all constant $k$ and any $delta$ in the regime when information-theoretic recovery is possible.
arXiv Detail & Related papers (2021-06-18T22:20:12Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Deterministic Algorithms for the Hidden Subgroup Problem [3.2590610391507444]
We present deterministic algorithms for the Hidden Subgroup Problem.
For abelian groups, the first algorithm achieves the same worst-case query complexity as the optimal randomized algorithm.
The analogous algorithm for non-abelian groups comes within a $sqrt log n$ factor of the optimal randomized query complexity.
arXiv Detail & Related papers (2021-04-29T15:55:15Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Query complexity of heavy hitter estimation [6.373263986460191]
We consider the problem of identifying the subset $mathcalSgamma_mathcalP$ of elements in the support of an underlying distribution $mathcalP$.
We consider two query models: $(a)$ each query is an index $i$ and the oracle return the value $X_i$ and $(b)$ each query is a pair $(i,j)$.
For each of these query models, we design sequential estimation algorithms which at each round, either decide what query to send to the oracle depending on the entire
arXiv Detail & Related papers (2020-05-29T07:15:46Z) - Tight Quantum Lower Bound for Approximate Counting with Quantum States [49.6558487240078]
We prove tight lower bounds for the following variant of the counting problem considered by Aaronson, Kothari, Kretschmer, and Thaler ( 2020)
The task is to distinguish whether an input set $xsubseteq [n]$ has size either $k$ or $k'=(1+varepsilon)k$.
arXiv Detail & Related papers (2020-02-17T10:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.