Related papers: Shape complexity in cluster analysis

Shape complexity in cluster analysis

URL: http://arxiv.org/abs/2205.08046v2
Date: Wed, 18 May 2022 10:59:59 GMT
Title: Shape complexity in cluster analysis
Authors: Eduardo J. Aguilar, Valmir C. Barbosa
Abstract summary: In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.

Related papers

Core-Set Selection for Data-efficient Land Cover Segmentation [16.89537279044251]
We propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets.<n>We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets.<n>This result shows the importance and potential of data-centric learning for the remote sensing domain.
arXiv Detail & Related papers (2025-05-02T12:22:08Z)
Categorical Data Clustering via Value Order Estimated Distance Metric Learning [53.28598689867732]
This paper introduces a novel order distance metric learning approach to intuitively represent categorical attribute values.<n>A new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning.<n>The proposed method achieves superior clustering accuracy on categorical and mixed datasets.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z)
Learning-Augmented K-Means Clustering Using Dimensional Reduction [1.7243216387069678]
We propose a solution to reduce the dimensionality of the dataset using Principal Component Analysis (PCA) PCA is well-established in the literature and has become one of the most useful tools for data modeling, compression, and visualization.
arXiv Detail & Related papers (2024-01-06T12:02:33Z)
Spatio-Temporal Surrogates for Interaction of a Jet with High Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data [0.0]
In this report, we consider output data from simulations of a jet interacting with high explosives. We show how we can use the randomness of both the random projections, and the choice of initial centroids in k-means clustering, to determine the number of clusters in our data set.
arXiv Detail & Related papers (2023-07-03T23:36:43Z)
Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation. Specifically, we construct distance matrix between data points by Butterworth filter. To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z)
Influence of Swarm Intelligence in Data Clustering Mechanisms [0.0]
Nature inspired Swarm based algorithms are used for data clustering to cope with larger datasets with lack and inconsistency of data. This paper reviews the performances of these new approaches and compares which is best for certain problematic situation.
arXiv Detail & Related papers (2023-05-07T08:40:50Z)
Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located. The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z)
Transferable Deep Metric Learning for Clustering [1.2762298148425795]
Clustering in high spaces is a difficult task; the usual dimension distance metrics may no longer be appropriate under the curse of dimensionality. We show that we can learn a metric on a labelled dataset, then apply it to cluster a different dataset. We achieve results competitive with the state-of-the-art while using only a small number of labelled training datasets and shallow networks.
arXiv Detail & Related papers (2023-02-13T17:09:59Z)
Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. We provide implementable differentially private clustering algorithms that provide utility when the data is "easy" We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z)
Local versions of sum-of-norms clustering [77.34726150561087]
We show that our method can separate arbitrarily close balls in the ball model. We prove a quantitative bound on the error incurred in the clustering of disjoint connected sets.
arXiv Detail & Related papers (2021-09-20T14:45:29Z)
Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees. Our approach leads to significantly better bounds for datasets with known rates of singular value decay. We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.