Related papers: Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

URL: http://arxiv.org/abs/2403.05669v1
Date: Fri, 8 Mar 2024 20:49:49 GMT
Title: Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes
Authors: Dylan Soemitro, Jeova Farias Sales Rocha Neto
Abstract summary: This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is adaptable to many real-world data settings. For example, mixed data, where the data is composed of numerical and categorical features, is typically handled via numerical discretization, dummy coding, or similarity computation that takes into account both data types. This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm, avoiding the need for data preprocessing or the use of sophisticated similarity functions. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. Furthermore, we demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data. Finally, we compare the performance of our algorithms against other related methods and show that it provides a competitive alternative to them in terms of performance and runtime.

Related papers

Order Is All You Need for Categorical Data Clustering [29.264630563297466]
Categorical data composed of nominal valued attributes are ubiquitous in knowledge discovery and data mining tasks. Due to the lack of well-defined metric space, categorical data distributions are difficult to intuitively understand. This paper introduces the new finding that the order relation among attribute values is the decisive factor in clustering accuracy.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
ClusterNet: A Perception-Based Clustering Model for Scattered Data [16.326062082938215]
Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques. We propose a learning strategy which directly operates on scattered data. We train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability.
arXiv Detail & Related papers (2023-04-27T13:41:12Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z)
Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. We provide implementable differentially private clustering algorithms that provide utility when the data is "easy" We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z)
Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data. In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data. Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z)
A New Parallel Adaptive Clustering and its Application to Streaming Data [0.0]
This paper presents a parallel adaptive clustering (PAC) algorithm to automatically classify data while simultaneously choosing a suitable number of classes. We develop regularized set mik-means to efficiently cluster the results from the parallel threads. We provide theoretical analysis and numerical experiments to characterize the performance of the method.
arXiv Detail & Related papers (2021-04-06T17:18:56Z)
Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method. Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features. On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z)
Fuzzy clustering algorithms with distance metric learning and entropy regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization. Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
Point-Set Kernel Clustering [11.093960688450602]
This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets.
arXiv Detail & Related papers (2020-02-14T00:00:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.