Related papers: OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

URL: http://arxiv.org/abs/2501.19285v1
Date: Fri, 31 Jan 2025 16:48:16 GMT
Title: OneBatchPAM: A Fast and Frugal K-Medoids Algorithm
Authors: Antoine de Mathelin, Nicolas Enrique Cecchi, François Deheeger, Mathilde Mougeot, Nicolas Vayatis,
Abstract summary: This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity.<n>A single batch of size m n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n2) compared to most k-medoids baselines.<n>We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm.
Score: 6.69456225406097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m << n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.

Related papers

Robust Clustering on High-Dimensional Data with Stochastic Quantization [0.0]
This paper addresses the limitations of conventional vector quantization algorithms. It investigates the Quantization (SQ) as an alternative for high-dimensionality computation.
arXiv Detail & Related papers (2024-09-03T17:13:55Z)
Quantum-Based Feature Selection for Multi-classification Problem in Complex Systems with Edge Computing [15.894122816099133]
A quantum-based feature selection algorithm for the multi-classification problem, namely, QReliefF, is proposed. Our algorithm is superior in finding the nearest neighbor, reducing the complexity from O(M) to O(sqrt(M)).
arXiv Detail & Related papers (2023-10-01T03:57:13Z)
Algorithme EM r\'egularis\'e [0.0]
This paper presents a regularized version of the EM algorithm that efficiently uses prior knowledge to cope with a small sample size. Experiments on real data highlight the good performance of the proposed algorithm for clustering purposes.
arXiv Detail & Related papers (2023-07-04T23:19:25Z)
Fast Optimal Locally Private Mean Estimation via Random Projections [58.603579803010796]
We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. We propose a new algorithmic framework, ProjUnit, for private mean estimation. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an optimal algorithm.
arXiv Detail & Related papers (2023-06-07T14:07:35Z)
Accelerated Doubly Stochastic Gradient Algorithm for Large-scale Empirical Risk Minimization [23.271890743506514]
We propose a doubly algorithm with a novel accelerating multimomentum technique to solve large scale empirical risk minimization problem for learning tasks. While enjoying a provably superior convergence rate, in each iteration, such algorithm only accesses a mini batch of samples and updates a small block of variable coordinates.
arXiv Detail & Related papers (2023-04-23T14:21:29Z)
Linearized Wasserstein dimensionality reduction with approximation guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z)
Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization [17.4921582710817]
K-medoids clustering is a popular variant of k-means clustering and widely used in pattern recognition and machine learning.<n>An improved k-medoids clustering algorithm, called INCKM algorithm, was recently proposed to overcome this drawback.<n>We propose a novel k-medoids clustering algorithm, called incremental k-means++ (INCKPP) algorithm, which initializes with a novel incremental manner.
arXiv Detail & Related papers (2022-07-06T02:25:35Z)
A Metaheuristic Algorithm for Large Maximum Weight Independent Set Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum. Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges. We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models. The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning. We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z)
Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems. Our results imply an improved algorithm for the Sample and Aggregate privacy framework. One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.