Similarity-based Distance for Categorical Clustering using Space
Structure
- URL: http://arxiv.org/abs/2011.09887v1
- Date: Thu, 19 Nov 2020 15:18:26 GMT
- Title: Similarity-based Distance for Categorical Clustering using Space
Structure
- Authors: Utkarsh Nath, Shikha Asrani, Rahul Katarya
- Abstract summary: We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data.
Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
- Score: 5.543220407902113
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Clustering is spotting pattern in a group of objects and resultantly grouping
the similar objects together. Objects have attributes which are not always
numerical, sometimes attributes have domain or categories to which they could
belong to. Such data is called categorical data. To group categorical data many
clustering algorithms are used, among which k- modes algorithm has so far given
the most significant results. Nevertheless, there is still a lot which could be
improved. Algorithms like k-means, fuzzy-c-means or hierarchical have given far
better accuracies with numerical data. In this paper, we have proposed a novel
distance metric, similarity-based distance (SBD) to find the distance between
objects of categorical data. Experiments have shown that our proposed distance
(SBD), when used with the SBC (space structure based clustering) type algorithm
significantly outperforms the existing algorithms like k-modes or other SBC
type algorithms when used on categorical datasets.
Related papers
- Data Aggregation for Hierarchical Clustering [0.3626013617212666]
BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm.
It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
arXiv Detail & Related papers (2023-09-05T19:39:43Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning [0.0]
We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity.
We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric.
arXiv Detail & Related papers (2023-06-02T19:51:48Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - ck-means, a novel unsupervised learning method that combines fuzzy and
crispy clustering methods to extract intersecting data [1.827510863075184]
This paper proposes a method to cluster data that share the same intersections between two features or more.
The main idea of this novel method is to generate fuzzy clusters of data using a Fuzzy C-Means (FCM) algorithm.
The algorithm is also able to find the optimal number of clusters for the FCM and the k-means algorithm, according to the consistency of the clusters given by the Silhouette Index (SI)
arXiv Detail & Related papers (2022-06-17T19:29:50Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z) - Point-Set Kernel Clustering [11.093960688450602]
This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects.
We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets.
arXiv Detail & Related papers (2020-02-14T00:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.