A Scalable k-Medoids Clustering via Whale Optimization Algorithm
- URL: http://arxiv.org/abs/2408.16993v1
- Date: Fri, 30 Aug 2024 03:43:37 GMT
- Title: A Scalable k-Medoids Clustering via Whale Optimization Algorithm
- Authors: Huang Chenan, Narumasa Tsutsumida,
- Abstract summary: We introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA)
By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations.
We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unsupervised clustering has emerged as a critical tool for uncovering hidden patterns and insights from vast, unlabeled datasets. However, traditional methods like Partitioning Around Medoids (PAM) struggle with scalability due to their quadratic computational complexity. To address this limitation, we introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA), a nature-inspired metaheuristic inspired by the hunting strategies of humpback whales. By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. This improvement in efficiency enables WOA-kMedoids to be scalable to large datasets while maintaining high clustering accuracy. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive. Our empirical results demonstrate that WOA-kMedoids maintains clustering accuracy similar to PAM. While WOA-kMedoids exhibited slightly higher runtime than PAM on small datasets (less than 300 observations), it outperformed PAM in computational efficiency on larger datasets. The scalability of WOA-kMedoids, combined with its consistently high accuracy, positions it as a promising and practical choice for unsupervised clustering in big data applications. WOA-kMedoids has implications for efficient knowledge discovery in massive, unlabeled datasets across various domains.
Related papers
- Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios [3.1231963031043786]
We introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism to model long-range dependencies.<n>At the heart of Diff-UMamba is a noise reduction module, which employs a signal differencing strategy to suppress noisy or irrelevant activations.<n>The architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings.
arXiv Detail & Related papers (2025-07-24T08:23:11Z) - A Comparison of Deep Learning Methods for Cell Detection in Digital Cytology [1.607370483729741]
We evaluate the performance of several Deep Learning (DL) methods for cell detection in Papanicolaou-stained cytological Whole Slide Images (WSIs)
We examine recentoff-the-shelf algorithms as well as custom-designed detectors, applying them to two datasets.
Results show that centroid-based methods, particularly the Improved Fully Convolutional Regression Network (IFCRN) method, outperform segmentation-based methods in terms of both detection accuracy and computational efficiency.
arXiv Detail & Related papers (2025-04-09T15:08:12Z) - Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion [5.575308369829893]
We propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence.<n>Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals, we verify our method by using the publicly available PTB-XL dataset.<n>The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets.
arXiv Detail & Related papers (2025-03-19T12:16:48Z) - Towards Learnable Anchor for Deep Multi-View Clustering [49.767879678193005]
In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time.<n>With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering.<n>Experiments on several datasets demonstrate superior performance and efficiency of DMAC compared to state-of-the-art competitors.
arXiv Detail & Related papers (2025-03-16T09:38:11Z) - OneBatchPAM: A Fast and Frugal K-Medoids Algorithm [6.69456225406097]
This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity.
A single batch of size m n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n2) compared to most k-medoids baselines.
We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm.
arXiv Detail & Related papers (2025-01-31T16:48:16Z) - SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms [60.35639972035727]
The lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms.
The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI.
Dice scores reached up to 0.838 $pm$ 0.066 and 0.716 $pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $pm$ 0.15.
arXiv Detail & Related papers (2024-11-14T17:06:00Z) - K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances [0.0]
We develop a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances.
We demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means.
These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.
arXiv Detail & Related papers (2024-10-31T00:05:09Z) - Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization [0.3069335774032178]
K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets.
We propose a novel algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data.
arXiv Detail & Related papers (2024-10-18T15:43:34Z) - Artificial Data Point Generation in Clustered Latent Space for Small
Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL)
AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation.
It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Learnable Weight Initialization for Volumetric Medical Image Segmentation [66.3030435676252]
We propose a learnable weight-based hybrid medical image segmentation approach.
Our approach is easy to integrate into any hybrid model and requires no external training data.
Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-15T17:55:05Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Federated Deep AUC Maximization for Heterogeneous Data with a Constant
Communication Complexity [77.78624443410216]
We propose improved FDAM algorithms for detecting heterogeneous chest data.
A result of this paper is that the communication of the proposed algorithm is strongly independent of the number of machines and also independent of the accuracy level.
Experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets and on medical chest Xray images from different organizations.
arXiv Detail & Related papers (2021-02-09T04:05:19Z) - SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge
Graph Summarization [64.56399911605286]
We propose SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module.
SumGNN outperforms the best baseline by up to 5.54%, and the performance gain is particularly significant in low data relation types.
arXiv Detail & Related papers (2020-10-04T00:14:57Z) - BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed
Bandits [16.1767275655842]
Current $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets.
We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n2)$ to $O(n log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice.
We empirically validate our results on several large real-world datasets, including a coding
arXiv Detail & Related papers (2020-06-11T22:17:16Z) - ROAM: Random Layer Mixup for Semi-Supervised Learning in Medical Imaging [43.26668942258135]
Medical image segmentation is one of the major challenges addressed by machine learning methods.
We propose ROAM, a RandOm lAyer Mixup, which generates more data points that have never seen before.
ROAM achieves state-of-the-art (SOTA) results in fully supervised (89.5%) and semi-supervised (87.0%) settings with a relative improvement of up to 2.40% and 16.50%, respectively for the whole-brain segmentation.
arXiv Detail & Related papers (2020-03-20T18:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.