Related papers: A Unified Framework for Gradient-based Clustering of Distributed Data

A Unified Framework for Gradient-based Clustering of Distributed Data

URL: http://arxiv.org/abs/2402.01302v1
Date: Fri, 2 Feb 2024 10:44:42 GMT
Title: A Unified Framework for Gradient-based Clustering of Distributed Data
Authors: Aleksandar Armacki, Dragana Bajovi\'c, Du\v{s}an Jakoveti\'c, Soummya Kar
Abstract summary: We develop a family of distributed clustering algorithms that work over networks of users. DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss. We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
Score: 51.904327888475606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We develop a family of distributed clustering algorithms that work over networks of users. In the proposed scenario, users contain a local dataset and communicate only with their immediate neighbours, with the aim of finding a clustering of the full, joint data. The proposed family, termed Distributed Gradient Clustering (DGC-$\mathcal{F}_\rho$), is parametrized by $\rho \geq 1$, controling the proximity of users' center estimates, with $\mathcal{F}$ determining the clustering loss. Specialized to popular clustering losses like $K$-means and Huber loss, DGC-$\mathcal{F}_\rho$ gives rise to novel distributed clustering algorithms DGC-KM$_\rho$ and DGC-HL$_\rho$, while a novel clustering loss based on the logistic function leads to DGC-LL$_\rho$. We provide a unified analysis and establish several strong results, under mild assumptions. First, the sequence of centers generated by the methods converges to a well-defined notion of fixed point, under any center initialization and value of $\rho$. Second, as $\rho$ increases, the family of fixed points produced by DGC-$\mathcal{F}_\rho$ converges to a notion of consensus fixed points. We show that consensus fixed points of DGC-$\mathcal{F}_{\rho}$ are equivalent to fixed points of gradient clustering over the full data, guaranteeing a clustering of the full data is produced. For the special case of Bregman losses, we show that our fixed points converge to the set of Lloyd points. Numerical experiments on real data confirm our theoretical findings and demonstrate strong performance of the methods.

Related papers

Linear Programming based Approximation to Individually Fair k-Clustering with Outliers [1.4460482276563802]
We develop an individually fair $k$-means clustering algorithm for datasets that contain outliers. For each point which is not an outlier, there must be a center within the $fracnk$ nearest neighbours of the given point. We provide theoretical guarantees that our method leads to a guaranteed approximation of the fair radius as well as the clustering cost.
arXiv Detail & Related papers (2024-12-14T18:16:46Z)
Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks. We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z)
Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering. In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z)
Dink-Net: Neural Clustering on Large Graphs [59.10189693120368]
A deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink. By discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. The clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss. Compared to the runner-up, Dink-Net 9.62% achieves NMI improvement on the ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges.
arXiv Detail & Related papers (2023-05-28T15:33:24Z)
Wasserstein $K$-means for clustering probability distributions [16.153709556346417]
In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. We show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric.
arXiv Detail & Related papers (2022-09-14T23:43:16Z)
Distribution free optimality intervals for clustering [1.7513645771137178]
Given data $mathcalD$ and a partition $mathcalC$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $mathcalC$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations.
arXiv Detail & Related papers (2021-07-30T06:13:56Z)
K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles. We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z)
Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic Analysis [14.048989759890475]
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center. We find a formula for the average performance distortion in the regime where the number of cluster centers are large. In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.
arXiv Detail & Related papers (2020-10-23T17:14:28Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Exact Recovery of Mangled Clusters with Same-Cluster Queries [20.03712152278538]
We study the cluster recovery problem in the semi-supervised active clustering framework. We design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k3 ln k ln n)$ oracle queries and $tildeO(kn + k3)$ time to recover the clustering with zero misclassification error.
arXiv Detail & Related papers (2020-06-08T15:27:58Z)
Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA. We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.