Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering
- URL: http://arxiv.org/abs/2511.09049v1
- Date: Thu, 13 Nov 2025 01:28:21 GMT
- Title: Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering
- Authors: Mingjie Zhao, Zhanpei Huang, Yang Lu, Mengke Li, Yiqun Zhang, Weifeng Su, Yiu-ming Cheung,
- Abstract summary: Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets.<n>Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values.<n>This paper breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly revealing various cluster distributions.
- Score: 51.11677202873771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.
Related papers
- Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering [60.05209293008078]
Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm for cluster analysis.<n>HarR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters $k$.
arXiv Detail & Related papers (2026-03-03T08:13:16Z) - Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering [54.20010572648918]
An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated.<n>We propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster.
arXiv Detail & Related papers (2025-11-08T03:24:22Z) - Manifold Clustering with Schatten p-norm Maximization [16.90743611125625]
We develop a new clustering framework based on manifold clustering.<n>Specifically, the algorithm uses labels to guide the manifold structure and perform clustering on it.<n>In order to naturally maintain the class balance in the clustering process, we maximize the Schatten p-norm of labels.
arXiv Detail & Related papers (2025-04-29T03:23:06Z) - Categorical Data Clustering via Value Order Estimated Distance Metric Learning [53.28598689867732]
This paper introduces a novel order distance metric learning approach to intuitively represent categorical attribute values.<n>A new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning.<n>The proposed method achieves superior clustering accuracy on categorical and mixed datasets.
arXiv Detail & Related papers (2024-11-19T08:23:25Z) - Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning [0.0]
We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity.
We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric.
arXiv Detail & Related papers (2023-06-02T19:51:48Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Unsupervised Heterogeneous Coupling Learning for Categorical
Representation [50.1603042640492]
This work introduces a UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings.
UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings.
The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models.
arXiv Detail & Related papers (2020-07-21T11:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.