Fuzzy Clustering with Similarity Queries
- URL: http://arxiv.org/abs/2106.02212v1
- Date: Fri, 4 Jun 2021 02:32:26 GMT
- Title: Fuzzy Clustering with Similarity Queries
- Authors: Wasim Huleihel, Arya Mazumdar, Soumyabrata Pal
- Abstract summary: The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
- Score: 56.96625809888241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The fuzzy or soft $k$-means objective is a popular generalization of the
well-known $k$-means problem, extending the clustering capability of the
$k$-means to datasets that are uncertain, vague, and otherwise hard to cluster.
In this paper, we propose a semi-supervised active clustering framework, where
the learner is allowed to interact with an oracle (domain expert), asking for
the similarity between a certain set of chosen items. We study the query and
computational complexities of clustering in this framework. We prove that
having a few of such similarity queries enables one to get a polynomial-time
approximation algorithm to an otherwise conjecturally NP-hard problem. In
particular, we provide probabilistic algorithms for fuzzy clustering in this
setting that asks $O(\mathsf{poly}(k)\log n)$ similarity queries and run with
polynomial-time-complexity, where $n$ is the number of items. The fuzzy
$k$-means objective is nonconvex, with $k$-means as a special case, and is
equivalent to some other generic nonconvex problem such as non-negative matrix
factorization. The ubiquitous Lloyd-type algorithms (or,
expectation-maximization algorithm) can get stuck at a local minima. Our
results show that by making few similarity queries, the problem becomes easier
to solve. Finally, we test our algorithms over real-world datasets, showing
their effectiveness in real-world applications.
Related papers
- Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Are Easy Data Easy (for K-Means) [0.0]
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm.
A new algorithm is proposed that is a variation of $k$-means++ via repeated subsampling when choosing a seed.
arXiv Detail & Related papers (2023-08-02T09:40:19Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z) - Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering.
A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities.
We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z) - Efficient Algorithms for Generating Provably Near-Optimal Cluster
Descriptors for Explainability [36.11663695534294]
We study the problem of making clusters more interpretable by extending a recent approach of constructing succinct representations for clusters.
We develop approximation algorithms with provable performance guarantees for the problem.
We also show applications to explain clusters from datasets, including clusters of genomic sequences that represent different threat levels.
arXiv Detail & Related papers (2020-02-06T19:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.