Clustering with Queries under Semi-Random Noise
- URL: http://arxiv.org/abs/2206.04583v1
- Date: Thu, 9 Jun 2022 16:02:00 GMT
- Title: Clustering with Queries under Semi-Random Noise
- Authors: Alberto Del Pia, Mingchen Ma, Christos Tzamos
- Abstract summary: We develop robust learning methods that tolerate general semi-random noise.
We show that information theoretically $Oleft(fracnk log n (1-2p)2right)$ queries suffice to learn any cluster of sufficiently large size.
- Score: 13.817228853960655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The seminal paper by Mazumdar and Saha \cite{MS17a} introduced an extensive
line of work on clustering with noisy queries. Yet, despite significant
progress on the problem, the proposed methods depend crucially on knowing the
exact probabilities of errors of the underlying fully-random oracle. In this
work, we develop robust learning methods that tolerate general semi-random
noise obtaining qualitatively the same guarantees as the best possible methods
in the fully-random model.
More specifically, given a set of $n$ points with an unknown underlying
partition, we are allowed to query pairs of points $u,v$ to check if they are
in the same cluster, but with probability $p$, the answer may be adversarially
chosen. We show that information theoretically $O\left(\frac{nk \log n}
{(1-2p)^2}\right)$ queries suffice to learn any cluster of sufficiently large
size. Our main result is a computationally efficient algorithm that can
identify large clusters with $O\left(\frac{nk \log n} {(1-2p)^2}\right) +
\text{poly}\left(\log n, k, \frac{1}{1-2p} \right)$ queries, matching the
guarantees of the best known algorithms in the fully-random model. As a
corollary of our approach, we develop the first parameter-free algorithm for
the fully-random model, answering an open question by \cite{MS17a}.
Related papers
- Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - On the average-case complexity of learning output distributions of
quantum circuits [55.37943886895049]
We show that learning the output distributions of brickwork random quantum circuits is average-case hard in the statistical query model.
This learning model is widely used as an abstract computational model for most generic learning algorithms.
arXiv Detail & Related papers (2023-05-09T20:53:27Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with
a Faulty Oracle [7.449644976563424]
We propose an elegant theoretical model for studying clustering with a faulty oracle.
It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters.
We provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(log2 n)$) for all constant $k$ and any $delta$ in the regime when information-theoretic recovery is possible.
arXiv Detail & Related papers (2021-06-18T22:20:12Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z) - Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering.
A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities.
We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.