Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning
- URL: http://arxiv.org/abs/2306.01890v3
- Date: Fri, 11 Oct 2024 16:16:15 GMT
- Title: Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning
- Authors: Jesse S. Ghashti, John R. J. Thompson,
- Abstract summary: We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity.
We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric.
- Score: 0.0
- License:
- Abstract: Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.
Related papers
- Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Transferable Deep Metric Learning for Clustering [1.2762298148425795]
Clustering in high spaces is a difficult task; the usual dimension distance metrics may no longer be appropriate under the curse of dimensionality.
We show that we can learn a metric on a labelled dataset, then apply it to cluster a different dataset.
We achieve results competitive with the state-of-the-art while using only a small number of labelled training datasets and shallow networks.
arXiv Detail & Related papers (2023-02-13T17:09:59Z) - A new nonparametric interpoint distance-based measure for assessment of
clustering [0.0]
A new interpoint distance-based measure is proposed to identify the optimal number of clusters present in a data set.
Our proposed criterion is compatible with any clustering algorithm, and can be used to determine the unknown number of clusters.
arXiv Detail & Related papers (2022-10-01T04:27:54Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data.
In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data.
Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Similarity-based Distance for Categorical Clustering using Space
Structure [5.543220407902113]
We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data.
Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
arXiv Detail & Related papers (2020-11-19T15:18:26Z) - CAST: A Correlation-based Adaptive Spectral Clustering Algorithm on
Multi-scale Data [34.89460002735166]
We study the problem of applying spectral clustering to cluster multi-scale data.
For multi-scale data, distance-based similarity is not effective because objects of a sparse cluster could be far apart.
We propose the algorithm CAST that applies trace Lasso to regularize the coefficient matrix.
arXiv Detail & Related papers (2020-06-08T09:46:35Z) - Stable and consistent density-based clustering via multiparameter
persistence [77.34726150561087]
We consider the degree-Rips construction from topological data analysis.
We analyze its stability to perturbations of the input data using the correspondence-interleaving distance.
We integrate these methods into a pipeline for density-based clustering, which we call Persistable.
arXiv Detail & Related papers (2020-05-18T19:45:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.