Exponentially Consistent Nonparametric Clustering of Data Streams
- URL: http://arxiv.org/abs/2411.13922v1
- Date: Thu, 21 Nov 2024 08:18:04 GMT
- Title: Exponentially Consistent Nonparametric Clustering of Data Streams
- Authors: Bhupender Singh, Ananth Ram Rajagopalan, Srikrishna Bhashyam,
- Abstract summary: We consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data streams generated from unknown distributions.
Existing results on nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$)
- Score: 4.2603120588176635
- License:
- Abstract: In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data streams generated from unknown distributions. The distributions of the $M$ data streams belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Our results show that SLINK is exponentially consistent for a larger class of problems than $k$-medoids distribution clustering. We also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
Related papers
- Clustering Mixtures of Bounded Covariance Distributions Under Optimal
Separation [44.25945344950543]
We study the clustering problem for mixtures of bounded covariance distributions.
We give the first poly-time algorithm for this clustering task.
Our algorithm is robust to $Omega(alpha)$-fraction of adversarial outliers.
arXiv Detail & Related papers (2023-12-19T01:01:53Z) - Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [69.15976031704687]
We propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability.
IAC maintains an overall computational complexity of $ mathcalO(n, textpolylog(n) $, making it scalable and practical for large-scale problems.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Self-supervised Contrastive Attributed Graph Clustering [110.52694943592974]
We propose a novel attributed graph clustering network, namely Self-supervised Contrastive Attributed Graph Clustering (SCAGC)
In SCAGC, by leveraging inaccurate clustering labels, a self-supervised contrastive loss, are designed for node representation learning.
For the OOS nodes, SCAGC can directly calculate their clustering labels.
arXiv Detail & Related papers (2021-10-15T03:25:28Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure.
We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z) - K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles.
We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z) - Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries [22.672233769934845]
We investigate the problem of exact cluster dependence using oracle queries.
We show that if we are allowed to make $(k2 + k )$, an additional queries can recover different different and arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary arbitrary
arXiv Detail & Related papers (2021-01-31T18:00:29Z) - Faster DBSCAN via subsampled similarity queries [42.93847281082316]
DBSCAN is a popular density-based clustering algorithm.
We propose SNG-DBSCAN, which clusters based on a subsampled $epsilon$-neighborhood graph.
arXiv Detail & Related papers (2020-06-11T18:57:54Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Statistical power for cluster analysis [0.0]
Cluster algorithms are increasingly popular in biomedical research.
We estimate power and accuracy for common analysis through simulation.
We recommend that researchers only apply cluster analysis when large subgroup separation is expected.
arXiv Detail & Related papers (2020-03-01T02:43:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.