Related papers: A Scalable k-Medoids Clustering via Whale Optimization Algorithm

A Scalable k-Medoids Clustering via Whale Optimization Algorithm

URL: http://arxiv.org/abs/2408.16993v1
Date: Fri, 30 Aug 2024 03:43:37 GMT
Title: A Scalable k-Medoids Clustering via Whale Optimization Algorithm
Authors: Huang Chenan, Narumasa Tsutsumida,
Abstract summary: We introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA) By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Unsupervised clustering has emerged as a critical tool for uncovering hidden patterns and insights from vast, unlabeled datasets. However, traditional methods like Partitioning Around Medoids (PAM) struggle with scalability due to their quadratic computational complexity. To address this limitation, we introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA), a nature-inspired metaheuristic inspired by the hunting strategies of humpback whales. By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. This improvement in efficiency enables WOA-kMedoids to be scalable to large datasets while maintaining high clustering accuracy. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive. Our empirical results demonstrate that WOA-kMedoids maintains clustering accuracy similar to PAM. While WOA-kMedoids exhibited slightly higher runtime than PAM on small datasets (less than 300 observations), it outperformed PAM in computational efficiency on larger datasets. The scalability of WOA-kMedoids, combined with its consistently high accuracy, positions it as a promising and practical choice for unsupervised clustering in big data applications. WOA-kMedoids has implications for efficient knowledge discovery in massive, unlabeled datasets across various domains.

Related papers

A Comparison of Deep Learning Methods for Cell Detection in Digital Cytology [1.607370483729741]
We evaluate the performance of several Deep Learning (DL) methods for cell detection in Papanicolaou-stained cytological Whole Slide Images (WSIs) We examine recentoff-the-shelf algorithms as well as custom-designed detectors, applying them to two datasets. Results show that centroid-based methods, particularly the Improved Fully Convolutional Regression Network (IFCRN) method, outperform segmentation-based methods in terms of both detection accuracy and computational efficiency.
arXiv Detail & Related papers (2025-04-09T15:08:12Z)
OneBatchPAM: A Fast and Frugal K-Medoids Algorithm [6.69456225406097]
This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. A single batch of size m n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm.
arXiv Detail & Related papers (2025-01-31T16:48:16Z)
K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances [0.0]
We develop a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances. We demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means. These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.
arXiv Detail & Related papers (2024-10-31T00:05:09Z)
Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization [0.3069335774032178]
K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. We propose a novel algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data.
arXiv Detail & Related papers (2024-10-18T15:43:34Z)
Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL) AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation. It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
Learnable Weight Initialization for Volumetric Medical Image Segmentation [66.3030435676252]
We propose a learnable weight-based hybrid medical image segmentation approach. Our approach is easy to integrate into any hybrid model and requires no external training data. Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-15T17:55:05Z)
Federated Deep AUC Maximization for Heterogeneous Data with a Constant Communication Complexity [77.78624443410216]
We propose improved FDAM algorithms for detecting heterogeneous chest data. A result of this paper is that the communication of the proposed algorithm is strongly independent of the number of machines and also independent of the accuracy level. Experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets and on medical chest Xray images from different organizations.
arXiv Detail & Related papers (2021-02-09T04:05:19Z)
SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization [64.56399911605286]
We propose SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module. SumGNN outperforms the best baseline by up to 5.54%, and the performance gain is particularly significant in low data relation types.
arXiv Detail & Related papers (2020-10-04T00:14:57Z)
BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits [16.1767275655842]
Current $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n2)$ to $O(n log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large real-world datasets, including a coding
arXiv Detail & Related papers (2020-06-11T22:17:16Z)
ROAM: Random Layer Mixup for Semi-Supervised Learning in Medical Imaging [43.26668942258135]
Medical image segmentation is one of the major challenges addressed by machine learning methods. We propose ROAM, a RandOm lAyer Mixup, which generates more data points that have never seen before. ROAM achieves state-of-the-art (SOTA) results in fully supervised (89.5%) and semi-supervised (87.0%) settings with a relative improvement of up to 2.40% and 16.50%, respectively for the whole-brain segmentation.
arXiv Detail & Related papers (2020-03-20T18:07:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.