Related papers: Similarity-based Distance for Categorical Clustering using Space Structure

Similarity-based Distance for Categorical Clustering using Space Structure

URL: http://arxiv.org/abs/2011.09887v1
Date: Thu, 19 Nov 2020 15:18:26 GMT
Title: Similarity-based Distance for Categorical Clustering using Space Structure
Authors: Utkarsh Nath, Shikha Asrani, Rahul Katarya
Abstract summary: We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data. Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
Score: 5.543220407902113
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Clustering is spotting pattern in a group of objects and resultantly grouping the similar objects together. Objects have attributes which are not always numerical, sometimes attributes have domain or categories to which they could belong to. Such data is called categorical data. To group categorical data many clustering algorithms are used, among which k- modes algorithm has so far given the most significant results. Nevertheless, there is still a lot which could be improved. Algorithms like k-means, fuzzy-c-means or hierarchical have given far better accuracies with numerical data. In this paper, we have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data. Experiments have shown that our proposed distance (SBD), when used with the SBC (space structure based clustering) type algorithm significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.

Related papers

CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering [54.20010572648918]
An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated.<n>We propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster.
arXiv Detail & Related papers (2025-11-08T03:24:22Z)
K*-Means: A Parameter-free Clustering Algorithm [55.20132267309382]
k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
arXiv Detail & Related papers (2025-05-17T08:41:07Z)
Categorical Data Clustering via Value Order Estimated Distance Metric Learning [53.28598689867732]
This paper introduces a novel order distance metric learning approach to intuitively represent categorical attribute values.<n>A new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning.<n>The proposed method achieves superior clustering accuracy on categorical and mixed datasets.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
A general framework for distributed approximate similarity search with arbitrary distances [0.5030361857850012]
Similarity search is a central problem in domains such as information management and retrieval or data analysis. Many similarity search algorithms are designed or specifically adapted to metric distances. This paper presents GDASC, a general framework for distributed approximate similarity search that accepts arbitrary distances.
arXiv Detail & Related papers (2024-05-22T16:19:52Z)
Data Aggregation for Hierarchical Clustering [0.3626013617212666]
BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm. It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
arXiv Detail & Related papers (2023-09-05T19:39:43Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning [0.0]
We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric.
arXiv Detail & Related papers (2023-06-02T19:51:48Z)
Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation. Specifically, we construct distance matrix between data points by Butterworth filter. To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z)
ck-means, a novel unsupervised learning method that combines fuzzy and crispy clustering methods to extract intersecting data [1.827510863075184]
This paper proposes a method to cluster data that share the same intersections between two features or more. The main idea of this novel method is to generate fuzzy clusters of data using a Fuzzy C-Means (FCM) algorithm. The algorithm is also able to find the optimal number of clusters for the FCM and the k-means algorithm, according to the consistency of the clusters given by the Silhouette Index (SI)
arXiv Detail & Related papers (2022-06-17T19:29:50Z)
Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points. We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering. We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z)
Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms. DPPs favor diversity of the center points within subsets. We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Data Structures & Algorithms for Exact Inference in Hierarchical Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure. Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)
Point-Set Kernel Clustering [11.093960688450602]
This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets.
arXiv Detail & Related papers (2020-02-14T00:00:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.