Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with
a Faulty Oracle
- URL: http://arxiv.org/abs/2106.10374v1
- Date: Fri, 18 Jun 2021 22:20:12 GMT
- Title: Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with
a Faulty Oracle
- Authors: Pan Peng, Jiapeng Zhang
- Abstract summary: We propose an elegant theoretical model for studying clustering with a faulty oracle.
It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters.
We provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(log2 n)$) for all constant $k$ and any $delta$ in the regime when information-theoretic recovery is possible.
- Score: 7.449644976563424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by applications in crowdsourced entity resolution in database,
signed edge prediction in social networks and correlation clustering, Mazumdar
and Saha [NIPS 2017] proposed an elegant theoretical model for studying
clustering with a faulty oracle. In this model, given a set of $n$ items which
belong to $k$ unknown groups (or clusters), our goal is to recover the clusters
by asking pairwise queries to an oracle. This oracle can answer the query that
``do items $u$ and $v$ belong to the same cluster?''. However, the answer to
each pairwise query errs with probability $\varepsilon$, for some
$\varepsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under
this model: one algorithm is query-optimal while time-inefficient (i.e.,
running in quasi-polynomial time), the other is time efficient (i.e., in
polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis
[WWW 2020] then gave a new time-efficient algorithm for the special case of $2$
clusters, which is query-optimal if the bias $\delta:=1-2\varepsilon$ of the
model is large. It was left as an open question whether one can obtain a
query-optimal, time-efficient algorithm for the general case of $k$ clusters
and other regimes of $\delta$.
In this paper, we make progress on the above question and provide a
time-efficient algorithm with nearly-optimal query complexity (up to a factor
of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when
information-theoretic recovery is possible. Our algorithm is built on a
connection to the stochastic block model.
Related papers
- Clustering with Non-adaptive Subset Queries [16.662507957069813]
Given a query $S subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise.
For adaptive algorithms with pair-wise queries, the number of required queries is known to be $Theta(nk)$.
Non-adaptive schemes require $Omega(n2)$ queries, which matches the trivial $O(n2)$ upper bound attained by querying every pair of points.
arXiv Detail & Related papers (2024-09-17T05:56:07Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - How to Query An Oracle? Efficient Strategies to Label Data [59.89900843097016]
We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning.
We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of $O(fracNk2)$.
In addition, we present an adaptive greedy query scheme, which achieves an average rate of $approx 0.2N$ queries per sample with triplet queries.
arXiv Detail & Related papers (2021-10-05T20:15:35Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Query complexity of heavy hitter estimation [6.373263986460191]
We consider the problem of identifying the subset $mathcalSgamma_mathcalP$ of elements in the support of an underlying distribution $mathcalP$.
We consider two query models: $(a)$ each query is an index $i$ and the oracle return the value $X_i$ and $(b)$ each query is a pair $(i,j)$.
For each of these query models, we design sequential estimation algorithms which at each round, either decide what query to send to the oracle depending on the entire
arXiv Detail & Related papers (2020-05-29T07:15:46Z) - Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering.
A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities.
We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.