Related papers: Categorical data clustering: 25 years beyond K-modes

Categorical data clustering: 25 years beyond K-modes

URL: http://arxiv.org/abs/2408.17244v3
Date: Sat, 25 Jan 2025 02:41:46 GMT
Title: Categorical data clustering: 25 years beyond K-modes
Authors: Tai Dinh, Wong Hauchi, Philippe Fournier-Viger, Daniil Lisik, Minh-Quyet Ha, Hieu-Chi Dam, Van-Nam Huynh,
Abstract summary: categorical data clustering is a common and important task in computer science.<n>This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years.<n>It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics.
Score: 1.545264698293902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.

Related papers

Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning [47.32771052588132]
The nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data.<n>We propose a Multi-Granular Competitiveization Learning algorithm to allow potential clusters to interactively tune themselves.<n>It is shown that the proposed MGCPL-guided Categorical Data Clustering approach is competent in exploring the nested distribution of multi-granular clusters.
arXiv Detail & Related papers (2026-01-23T06:33:08Z)
Mixed Data Clustering Survey and Challenges [0.0]
This paper introduces a clustering method grounded in pretopological spaces.<n> benchmarking against classical numerical clustering algorithms yields insights into the performance and effectiveness of the proposed method.
arXiv Detail & Related papers (2025-11-27T08:20:05Z)
Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering [51.11677202873771]
Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets.<n>Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values.<n>This paper breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly revealing various cluster distributions.
arXiv Detail & Related papers (2025-11-12T06:57:24Z)
Data clustering: an essential technique in data science [28.124442353352183]
The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, and introduces the workflow of clustering in data science. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.
arXiv Detail & Related papers (2024-12-25T03:14:18Z)
Order Is All You Need for Categorical Data Clustering [29.264630563297466]
Categorical data composed of nominal valued attributes are ubiquitous in knowledge discovery and data mining tasks. Due to the lack of well-defined metric space, categorical data distributions are difficult to intuitively understand. This paper introduces the new finding that the order relation among attribute values is the decisive factor in clustering accuracy.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z)
Incremental hierarchical text clustering methods: a review [49.32130498861987]
This study aims to analyze various hierarchical and incremental clustering techniques. The main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering.
arXiv Detail & Related papers (2023-12-12T22:27:29Z)
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery [76.82327473338734]
Generalized category discovery (GCD) aims to automatically cluster partially labeled data. Unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. We propose a Dynamic Conceptional Contrastive Learning framework, which can effectively improve clustering accuracy.
arXiv Detail & Related papers (2023-03-30T14:04:39Z)
Deep Clustering: A Comprehensive Survey [53.387957674512585]
Clustering analysis plays an indispensable role in machine learning and data mining. Deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks. Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering.
arXiv Detail & Related papers (2022-10-09T02:31:32Z)
Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z)
Hybrid Ensemble optimized algorithm based on Genetic Programming for imbalanced data classification [0.0]
We propose a hybrid ensemble algorithm based on Genetic Programming (GP) for two classes of imbalanced data classification. Experimental results show the performance of the proposed method on the specified data sets in the size of the training set shows 40% and 50% better accuracy than other dimensions of the minority class prediction.
arXiv Detail & Related papers (2021-06-02T14:14:38Z)
Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification [75.49600684537117]
Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness. We contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability. Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance.
arXiv Detail & Related papers (2021-01-18T22:55:40Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Neural Networks as Functional Classifiers [0.0]
We extend notable deep learning methodologies to the domain of functional data for the purpose of classification problems. We highlight the effectiveness of our method in a number of classification applications such as classification of spectrographic data.
arXiv Detail & Related papers (2020-10-09T00:11:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.