Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes
- URL: http://arxiv.org/abs/2403.05669v1
- Date: Fri, 8 Mar 2024 20:49:49 GMT
- Title: Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes
- Authors: Dylan Soemitro, Jeova Farias Sales Rocha Neto
- Abstract summary: This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm.
We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function.
We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering data objects into homogeneous groups is one of the most important
tasks in data mining. Spectral clustering is arguably one of the most important
algorithms for clustering, as it is appealing for its theoretical soundness and
is adaptable to many real-world data settings. For example, mixed data, where
the data is composed of numerical and categorical features, is typically
handled via numerical discretization, dummy coding, or similarity computation
that takes into account both data types. This paper explores a more natural way
to incorporate both numerical and categorical information into the spectral
clustering algorithm, avoiding the need for data preprocessing or the use of
sophisticated similarity functions. We propose adding extra nodes corresponding
to the different categories the data may belong to and show that it leads to an
interpretable clustering objective function. Furthermore, we demonstrate that
this simple framework leads to a linear-time spectral clustering algorithm for
categorical-only data. Finally, we compare the performance of our algorithms
against other related methods and show that it provides a competitive
alternative to them in terms of performance and runtime.
Related papers
- Order Is All You Need for Categorical Data Clustering [29.264630563297466]
Categorical data composed of nominal valued attributes are ubiquitous in knowledge discovery and data mining tasks.
Due to the lack of well-defined metric space, categorical data distributions are difficult to intuitively understand.
This paper introduces the new finding that the order relation among attribute values is the decisive factor in clustering accuracy.
arXiv Detail & Related papers (2024-11-19T08:23:25Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - ClusterNet: A Perception-Based Clustering Model for Scattered Data [16.326062082938215]
Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques.
We propose a learning strategy which directly operates on scattered data.
We train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability.
arXiv Detail & Related papers (2023-04-27T13:41:12Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data.
In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data.
Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z) - Fuzzy clustering algorithms with distance metric learning and entropy
regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization.
Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Point-Set Kernel Clustering [11.093960688450602]
This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects.
We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets.
arXiv Detail & Related papers (2020-02-14T00:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.