Too Much Information Kills Information: A Clustering Perspective
- URL: http://arxiv.org/abs/2009.07417v1
- Date: Wed, 16 Sep 2020 01:54:26 GMT
- Title: Too Much Information Kills Information: A Clustering Perspective
- Authors: Yicheng Xu, Vincent Chau, Chenchen Wu, Yong Zhang, Vassilis
Zissimopoulos, Yifei Zou
- Abstract summary: We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only.
With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
- Score: 6.375668163098171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering is one of the most fundamental tools in the artificial
intelligence area, particularly in the pattern recognition and learning theory.
In this paper, we propose a simple, but novel approach for variance-based
k-clustering tasks, included in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes
decisions based on the data information in the subset only. With certain
assumptions, the resulting clustering is provably good to estimate the optimum
of the variance-based objective with high probability. Extensive experiments on
synthetic datasets and real-world datasets show that to obtain competitive
results compared with k-means method (Llyod 1982) and k-means++ method (Arthur
and Vassilvitskii 2007), we only need 7% information of the dataset. If we have
up to 15% information of the dataset, then our algorithm outperforms both the
k-means method and k-means++ method in at least 80% of the clustering tasks, in
terms of the quality of clustering. Also, an extended algorithm based on the
same idea guarantees a balanced k-clustering result.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Influence of Swarm Intelligence in Data Clustering Mechanisms [0.0]
Nature inspired Swarm based algorithms are used for data clustering to cope with larger datasets with lack and inconsistency of data.
This paper reviews the performances of these new approaches and compares which is best for certain problematic situation.
arXiv Detail & Related papers (2023-05-07T08:40:50Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - How to Use K-means for Big Data Clustering? [2.1165011830664677]
K-means is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model.
We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering.
arXiv Detail & Related papers (2022-04-14T08:18:01Z) - A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters.
Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - A Multi-disciplinary Ensemble Algorithm for Clustering Heterogeneous
Datasets [0.76146285961466]
We propose a new evolutionary clustering algorithm (ECAStar) based on social class ranking and meta-heuristic algorithms.
ECAStar is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques.
Experiments are conducted to evaluate the ECAStar against five conventional approaches.
arXiv Detail & Related papers (2021-01-01T07:20:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.