Clustering Large Data Sets with Incremental Estimation of Low-density
Separating Hyperplanes
- URL: http://arxiv.org/abs/2108.03442v1
- Date: Sat, 7 Aug 2021 12:45:05 GMT
- Title: Clustering Large Data Sets with Incremental Estimation of Low-density
Separating Hyperplanes
- Authors: David P. Hofmeyr
- Abstract summary: An efficient method for obtaining low-density hyperplane separators in the unsupervised context is proposed.
Experiments with the proposed approach show that it is highly competitive in terms of both speed and accuracy when compared with relevant benchmarks.
- Score: 16.3460693863947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An efficient method for obtaining low-density hyperplane separators in the
unsupervised context is proposed. Low density separators can be used to obtain
a partition of a set of data based on their allocations to the different sides
of the separators. The proposed method is based on applying stochastic gradient
descent to the integrated density on the hyperplane with respect to a
convolution of the underlying distribution and a smoothing kernel. In the case
where the bandwidth of the smoothing kernel is decreased towards zero, the bias
of these updates with respect to the true underlying density tends to zero, and
convergence to a minimiser of the density on the hyperplane can be obtained. A
post-processing of the partition induced by a collection of low-density
hyperplanes yields an efficient and accurate clustering method which is capable
of automatically selecting an appropriate number of clusters. Experiments with
the proposed approach show that it is highly competitive in terms of both speed
and accuracy when compared with relevant benchmarks. Code to implement the
proposed approach is available in the form of an R package from
https://github.com/DavidHofmeyr/iMDH.
Related papers
- Clustering Based on Density Propagation and Subcluster Merging [92.15924057172195]
We propose a density-based node clustering approach that automatically determines the number of clusters and can be applied in both data space and graph space.
Unlike traditional density-based clustering methods, which necessitate calculating the distance between any two nodes, our proposed technique determines density through a propagation process.
arXiv Detail & Related papers (2024-11-04T04:09:36Z) - Spectral Clustering for Discrete Distributions [22.450518079181542]
Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods.
We show that spectral clustering combined with distribution affinity measures can be more accurate and efficient than barycenter methods.
We provide theoretical guarantees for the success of our methods in clustering distributions.
arXiv Detail & Related papers (2024-01-25T03:17:03Z) - DECWA : Density-Based Clustering using Wasserstein Distance [1.4132765964347058]
We propose a new clustering algorithm based on spatial density and probabilistic approach.
We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
arXiv Detail & Related papers (2023-10-25T11:10:08Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Density Ratio Estimation via Infinitesimal Classification [85.08255198145304]
We propose DRE-infty, a divide-and-conquer approach to reduce Density ratio estimation (DRE) to a series of easier subproblems.
Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions.
We show that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.
arXiv Detail & Related papers (2021-11-22T06:26:29Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - A Note on Optimizing Distributions using Kernel Mean Embeddings [94.96262888797257]
Kernel mean embeddings represent probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space.
We show that when the kernel is characteristic, distributions with a kernel sum-of-squares density are dense.
We provide algorithms to optimize such distributions in the finite-sample setting.
arXiv Detail & Related papers (2021-06-18T08:33:45Z) - Tensor Laplacian Regularized Low-Rank Representation for Non-uniformly
Distributed Data Subspace Clustering [2.578242050187029]
Low-Rank Representation (LRR) suffers from discarding the locality information of data points in subspace clustering.
We propose a hypergraph model to facilitate having a variable number of adjacent nodes and incorporating the locality information of the data.
Experiments on artificial and real datasets demonstrate the higher accuracy and precision of the proposed method.
arXiv Detail & Related papers (2021-03-06T08:22:24Z) - Local Graph Clustering with Network Lasso [90.66817876491052]
We study the statistical and computational properties of a network Lasso method for local graph clustering.
The clusters delivered by nLasso can be characterized elegantly via network flows between cluster boundary and seed nodes.
arXiv Detail & Related papers (2020-04-25T17:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.