Related papers: Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Categorical Data Clustering via Value Order Estimated Distance Metric Learning

URL: http://arxiv.org/abs/2411.15189v2
Date: Sun, 16 Feb 2025 12:03:08 GMT
Title: Categorical Data Clustering via Value Order Estimated Distance Metric Learning
Authors: Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung,
Abstract summary: This paper introduces a new finding that the order relation among attribute values is the decisive factor in clustering accuracy.<n>We propose a new learning paradigm that allows joint learning of clusters and the orders.<n>The algorithm achieves superior clustering accuracy with a convergence guarantee.
Score: 31.851890008893847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Categorical data composed of qualitative valued attributes are ubiquitous in machine learning tasks. Due to the lack of well-defined metric space, categorical data distributions are difficult to be intuitively understood. Clustering is a popular data analysis technique suitable for data distribution understanding. However, the success of clustering often relies on reasonable distance metrics, which happens to be what categorical data naturally lack. This paper therefore introduces a new finding that the order relation among attribute values is the decisive factor in clustering accuracy, and is also the key to understanding categorical data clusters, because the essence of clustering is to order the clusters in terms of their admission to samples. To obtain the orders, we propose a new learning paradigm that allows joint learning of clusters and the orders. It alternatively partitions the data into clusters based on the distance metric built upon the orders and estimates the most likely orders according to the clusters. The algorithm achieves superior clustering accuracy with a convergence guarantee, and the learned orders facilitate the understanding of the non-intuitive cluster distribution of categorical data. Extensive experiments with ablation studies, statistical evidence, and case studies have validated the new insight into the importance of value order and the method proposition. The source code is temporarily opened in https://anonymous.4open.science/r/OCL-demo.

Related papers

Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning [3.4530027457862005]
We introduce a novel clustering approach based on meta-learning.<n>We employ a pre-trained Prior-Data Fitted Transformer Network (PFN) to perform clustering.<n>We show that our approach is superior to the state-of-the-art clustering techniques.
arXiv Detail & Related papers (2025-07-27T17:53:19Z)
Personalized Clustering via Targeted Representation Learning [12.685373069492448]
Clustering traditionally aims to reveal a natural grouping structure within unlabeled data. We propose a personalized clustering method that explicitly performs targeted representation learning.
arXiv Detail & Related papers (2024-12-18T10:28:51Z)
ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items. The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z)
Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z)
Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering. In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z)
Using Decision Trees for Interpretable Supervised Clustering [0.0]
supervised clustering aims at forming clusters of labelled data with high probability densities. We are particularly interested in finding clusters of data of a given class and describing the clusters with the set of comprehensive rules.
arXiv Detail & Related papers (2023-07-16T17:12:45Z)
A testing-based approach to assess the clusterability of categorical data [6.7937877930001775]
TestCat is a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions for numeric data.
arXiv Detail & Related papers (2023-07-14T13:50:00Z)
Actively Supervised Clustering for Open Relation Extraction [42.114747195195655]
We present a novel setting, named actively supervised clustering for OpenRE. The key to the setting is selecting which instances to label. We propose a new strategy, which is applicable to dynamically discover clusters of unknown relations.
arXiv Detail & Related papers (2023-06-08T06:55:02Z)
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery [76.82327473338734]
Generalized category discovery (GCD) aims to automatically cluster partially labeled data. Unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. We propose a Dynamic Conceptional Contrastive Learning framework, which can effectively improve clustering accuracy.
arXiv Detail & Related papers (2023-03-30T14:04:39Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
Clustering Optimisation Method for Highly Connected Biological Data [0.0]
We show how a simple metric for connectivity clustering evaluation leads to an optimised segmentation of biological data. The novelty of the work resides in the creation of a simple optimisation method for clustering crowded data.
arXiv Detail & Related papers (2022-08-08T17:33:32Z)
Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach [0.0]
Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together. This article provides a deep description of the most widely used clustering methodologies. It emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets.
arXiv Detail & Related papers (2022-07-14T14:22:36Z)
You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation. We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one. By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z)
Predictive K-means with local models [0.028675177318965035]
Predictive clustering seeks to obtain the best of the two worlds. We present two new algorithms using this technique and show on a variety of data sets that they are competitive for prediction performance.
arXiv Detail & Related papers (2020-12-16T10:49:36Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Structured Graph Learning for Clustering and Semi-supervised Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data. Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure. Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.