Is Simple Uniform Sampling Effective for Center-Based Clustering with
Outliers: When and Why?
- URL: http://arxiv.org/abs/2103.00558v5
- Date: Tue, 3 Oct 2023 14:59:09 GMT
- Title: Is Simple Uniform Sampling Effective for Center-Based Clustering with
Outliers: When and Why?
- Authors: Jiawei Huang, Wenjie Liu and Hu Ding
- Abstract summary: We propose a simple uniform sampling framework for solving three representative center-based clustering with outliers problems.
Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas.
- Score: 14.757827466271209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world datasets often contain outliers, and the presence of outliers can
make the clustering problems to be much more challenging. In this paper, we
propose a simple uniform sampling framework for solving three representative
center-based clustering with outliers problems: $k$-center/median/means
clustering with outliers. Our analysis is fundamentally different from the
previous (uniform and non-uniform) sampling based ideas. To explain the
effectiveness of uniform sampling in theory, we introduce a measure of
"significance" and prove that the performance of our framework depends on the
significance degree of the given instance. In particular, the sample size can
be independent of the input data size $n$ and the dimensionality $d$, if we
assume the given instance is "significant", which is in fact a fairly
reasonable assumption in practice. Due to its simplicity, the uniform sampling
approach also enjoys several significant advantages over the non-uniform
sampling approaches in practice. To the best of our knowledge, this is the
first work that systematically studies the effectiveness of uniform sampling
from both theoretical and experimental aspects.
Related papers
- Mini-batch Submodular Maximization [5.439020425819001]
We present the first mini-batch algorithm for maximizing a monotone decomposable submodular function, $F=sum_i=1N fi$, under a set of constraints.
We consider two sampling approaches: uniform and weighted.
Surprisingly, our experimental results show that uniform sampling is superior to weighted sampling.
arXiv Detail & Related papers (2024-01-23T04:16:58Z) - Sample Complexity of Robust Learning against Evasion Attacks [3.8888996044605855]
We study the feasibility of adversarially robust learning from the perspective of learning theory, considering sample complexity.
We show that, under the uniform distribution, the exponential dependence on the adversary's budget to robustly learn conjunctions remains inevitable.
We show that if the query radius is equal to the adversary's budget, we can develop robust empirical risk algorithms in the distribution-free setting.
arXiv Detail & Related papers (2023-08-23T10:51:33Z) - Hard Sample Aware Network for Contrastive Deep Graph Clustering [38.44763843990694]
We propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN)
In our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings.
Under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples.
arXiv Detail & Related papers (2022-12-16T16:57:37Z) - Rethinking Collaborative Metric Learning: Toward an Efficient
Alternative without Negative Sampling [156.7248383178991]
Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS)
We find that negative sampling would lead to a biased estimation of the generalization error.
Motivated by this, we propose an efficient alternative without negative sampling for CML named textitSampling-Free Collaborative Metric Learning (SFCML)
arXiv Detail & Related papers (2022-06-23T08:50:22Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Optimal Importance Sampling for Federated Learning [57.14673504239551]
Federated learning involves a mixture of centralized and decentralized processing tasks.
The sampling of both agents and data is generally uniform; however, in this work we consider non-uniform sampling.
We derive optimal importance sampling strategies for both agent and data selection and show that non-uniform sampling without replacement improves the performance of the original FedAvg algorithm.
arXiv Detail & Related papers (2020-10-26T14:15:33Z) - Understanding Negative Sampling in Graph Representation Learning [87.35038268508414]
We show that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance.
We propose Metropolis-Hastings (MCNS) to approximate the positive distribution with self-contrast approximation and accelerate negative sampling by Metropolis-Hastings.
We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and personalized recommendation.
arXiv Detail & Related papers (2020-05-20T06:25:21Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.