An Efficient $k$-modes Algorithm for Clustering Categorical Datasets
- URL: http://arxiv.org/abs/2006.03936v3
- Date: Wed, 23 Jun 2021 20:18:20 GMT
- Title: An Efficient $k$-modes Algorithm for Clustering Categorical Datasets
- Authors: Karin S. Dorman and Ranjan Maitra
- Abstract summary: We provide a novel, computationally efficient implementation of $k$-modes, called OTQT.
We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms.
OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum.
- Score: 8.528384027684194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mining clusters from data is an important endeavor in many applications. The
$k$-means method is a popular, efficient, and distribution-free approach for
clustering numerical-valued data, but does not apply for categorical-valued
observations. The $k$-modes method addresses this lacuna by replacing the
Euclidean with the Hamming distance and the means with the modes in the
$k$-means objective function. We provide a novel, computationally efficient
implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to
improve the objective function that are undetectable to existing $k$-modes
algorithms. Although slightly slower per iteration due to algorithmic
complexity, OTQT is always more accurate per iteration and almost always faster
(and only barely slower on some datasets) to the final optimum. Thus, we
recommend OTQT as the preferred, default algorithm for $k$-modes optimization.
Related papers
- Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and Implementation [0.0]
This thesis bridges theory and practice for 1D $k$-means clustering, delivering efficient and sound algorithms.
Benchmarks demonstrate over a 4500x speedup compared to scikit-learn for large datasets.
arXiv Detail & Related papers (2024-12-19T09:03:39Z) - Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.
In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective.
Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Systematically improving existing k-means initialization algorithms at
nearly no cost, by pairwise-nearest-neighbor smoothing [1.2570180539670577]
We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing.
It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
arXiv Detail & Related papers (2022-02-08T15:56:30Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Learning Sparse Classifiers: Continuous and Mixed Integer Optimization
Perspectives [10.291482850329892]
Mixed integer programming (MIP) can be used to solve (to optimality) $ell_0$-regularized regression problems.
We propose two classes of scalable algorithms: an exact algorithm that can handlepapprox 50,000$ features in a few minutes, and approximate algorithms that can address instances with $papprox6$.
In addition, we present new estimation error bounds for $ell$-regularizeds.
arXiv Detail & Related papers (2020-01-17T18:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.