Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle
- URL: http://arxiv.org/abs/2202.08522v3
- Date: Sat, 21 Oct 2023 07:27:33 GMT
- Title: Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle
- Authors: Chandra Sekhar Mukherjee, Pan Peng and Jiapeng Zhang
- Abstract summary: oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
- Score: 9.578056676899203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The stochastic block model (SBM) is a fundamental model for studying graph
clustering or community detection in networks. It has received great attention
in the last decade and the balanced case, i.e., assuming all clusters have
large size, has been well studied. However, our understanding of SBM with
unbalanced communities (arguably, more relevant in practice) is still limited.
In this paper, we provide a simple SVD-based algorithm for recovering the
communities in the SBM with communities of varying sizes. We improve upon a
result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption
that there is a large interval such that the sizes of clusters do not fall in,
and also remove the dependency of the size of the recoverable clusters on the
number of underlying clusters. We further complement our theoretical
improvements with experimental comparisons. Under the planted clique
conjecture, the size of the clusters that can be recovered by our algorithm is
nearly optimal (up to poly-logarithmic factors) when the probability parameters
are constant.
As a byproduct, we obtain an efficient clustering algorithm with sublinear
query complexity in a faulty oracle model, which is capable of detecting all
clusters larger than $\tilde{\Omega}({\sqrt{n}})$, even in the presence of
$\Omega(n)$ small clusters in the graph. In contrast, previous efficient
algorithms that use a sublinear number of queries are incapable of recovering
any large clusters if there are more than $\tilde{\Omega}(n^{2/5})$ small
clusters.
Related papers
- Gap-Free Clustering: Sensitivity and Robustness of SDP [6.996002801232415]
We study graph clustering in the Block Model (SBM) in the presence of both large clusters and small, unrecoverable clusters.
Previous convex relaxation approaches achieving exact recovery do not allow any small clusters of size $o(sqrtn)$, or require a size gap between the smallest recovered cluster and the largest non-recovered cluster.
We provide an algorithm based on semidefinite programming (SDP) which removes these requirements and provably recovers large clusters regardless of the remaining cluster sizes.
arXiv Detail & Related papers (2023-08-29T21:27:21Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Dink-Net: Neural Clustering on Large Graphs [59.10189693120368]
A deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink.
By discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner.
The clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss.
Compared to the runner-up, Dink-Net 9.62% achieves NMI improvement on the ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges.
arXiv Detail & Related papers (2023-05-28T15:33:24Z) - Convex Clustering through MM: An Efficient Algorithm to Perform
Hierarchical Clustering [1.0589208420411012]
We propose convex clustering through majorization-minimization ( CCMM) -- an iterative algorithm that uses cluster fusions and a highly efficient updating scheme.
With a current desktop computer, CCMM efficiently solves convex clustering problems featuring over one million objects in seven-dimensional space.
arXiv Detail & Related papers (2022-11-03T15:07:51Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Correlation Clustering Reconstruction in Semi-Adversarial Models [70.11015369368272]
Correlation Clustering is an important clustering problem with many applications.
We study the reconstruction version of this problem in which one is seeking to reconstruct a latent clustering corrupted by random noise and adversarial modifications.
arXiv Detail & Related papers (2021-08-10T14:46:17Z) - On Margin-Based Cluster Recovery with Oracle Queries [22.672233769934845]
We study an active cluster recovery problem where, given a set of $n$ points oracle and an answering queries like "are these two points in the same cluster?"
We give an algorithm that recovers arbitrary convex clusters in exactly time, and with a number of queries that is lower than the best existing algorithm by $Theta(mm)$ factors.
For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(log n)$ query bound, and is provably near optimal.
arXiv Detail & Related papers (2021-06-09T08:48:23Z) - Exact Recovery of Mangled Clusters with Same-Cluster Queries [20.03712152278538]
We study the cluster recovery problem in the semi-supervised active clustering framework.
We design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k3 ln k ln n)$ oracle queries and $tildeO(kn + k3)$ time to recover the clustering with zero misclassification error.
arXiv Detail & Related papers (2020-06-08T15:27:58Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.