Randomized Greedy Algorithms and Composable Coreset for k-Center
Clustering with Outliers
- URL: http://arxiv.org/abs/2301.02814v1
- Date: Sat, 7 Jan 2023 09:26:01 GMT
- Title: Randomized Greedy Algorithms and Composable Coreset for k-Center
Clustering with Outliers
- Authors: Hu Ding, Ruomin Huang, Kai Liu, Haikuo Yu, Zixiu Wang
- Abstract summary: The presence of outliers can significantly increase the computational complexity.
Our idea is inspired by the greedy method, that was developed for solving the ordinary $k$-center clustering problem.
- Score: 11.546734084378683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of {\em $k$-center clustering with
outliers}. The problem has many important applications in real world, but the
presence of outliers can significantly increase the computational complexity.
Though a number of methods have been developed in the past decades, it is still
quite challenging to design quality guaranteed algorithm with low complexity
for this problem. Our idea is inspired by the greedy method, Gonzalez's
algorithm, that was developed for solving the ordinary $k$-center clustering
problem. Based on some novel observations, we show that a simple randomized
version of this greedy strategy actually can handle outliers efficiently. We
further show that this randomized greedy approach also yields small coreset for
the problem in doubling metrics (even if the doubling dimension is not given),
which can greatly reduce the computational complexity. Moreover, together with
the partial clustering framework proposed in arXiv:1703.01539 , we prove that
our coreset method can be applied to distributed data with a low communication
complexity. The experimental results suggest that our algorithms can achieve
near optimal solutions and yield lower complexities comparing with the existing
methods.
Related papers
- From Large to Small Datasets: Size Generalization for Clustering
Algorithm Selection [12.993073967843292]
We study a problem in a semi-supervised setting, with an unknown ground-truth clustering.
We introduce a notion of size generalization for clustering algorithm accuracy.
We use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
arXiv Detail & Related papers (2024-02-22T06:53:35Z) - A cutting plane algorithm for globally solving low dimensional k-means
clustering problems [4.5594982923247995]
We consider the k-means problem for instances with low dimensional data and formulate it as a structured concave assignment problem.
This allows us to exploit the low dimensional structure and solve the problem to global optimality within reasonable time.
The paper combines methods from global optimization theory to accelerate the procedure, and we provide numerical results.
arXiv Detail & Related papers (2024-02-21T07:55:33Z) - A Novel Normalized-Cut Solver with Nearest Neighbor Hierarchical
Initialization [107.07093621337084]
Normalized-Cut (N-Cut) is a famous model of spectral clustering.
Traditional N-Cut solvers are two-stage: 1) calculating the continuous spectral embedding of normalized Laplacian matrix; 2) discretization via $K$-means or spectral rotation.
We propose a novel N-Cut solver based on the famous coordinate descent method.
arXiv Detail & Related papers (2023-11-26T07:11:58Z) - Scalable Clustering: Large Scale Unsupervised Learning of Gaussian
Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization.
It provides theoretical guarantees that the algorithm obtains high accuracy with high probability.
Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Clustering of Big Data with Mixed Features [3.3504365823045044]
We develop a new clustering algorithm for large data of mixed type.
The algorithm is capable of detecting outliers and clusters of relatively lower density values.
We present experimental results to verify that our algorithm works well in practice.
arXiv Detail & Related papers (2020-11-11T19:54:38Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - Simple and Scalable Sparse k-means Clustering via Feature Ranking [14.839931533868176]
We propose a novel framework for sparse k-means clustering that is intuitive, simple to implement, and competitive with state-of-the-art algorithms.
Our core method readily generalizes to several task-specific algorithms such as clustering on subsets of attributes and in partially observed data settings.
arXiv Detail & Related papers (2020-02-20T02:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.