Structured Inverted-File k-Means Clustering for High-Dimensional Sparse
Data
- URL: http://arxiv.org/abs/2103.16141v1
- Date: Tue, 30 Mar 2021 07:54:02 GMT
- Title: Structured Inverted-File k-Means Clustering for High-Dimensional Sparse
Data
- Authors: Kazuo Aoyama and Kazumi Saito
- Abstract summary: This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set.
Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions.
- Score: 2.487445341407889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an architecture-friendly k-means clustering algorithm
called SIVF for a large-scale and high-dimensional sparse data set. Algorithm
efficiency on time is often measured by the number of costly operations such as
similarity calculations. In practice, however, it depends greatly on how the
algorithm adapts to an architecture of the computer system which it is executed
on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to
decrease the number of similarity calculations between a data object and
centroids of all the clusters. To maximize the ICP performance, SIVF exploits
for a centroid set an inverted-file that is structured so as to reduce pipeline
hazards. We demonstrate in our experiments on real large-scale document data
sets that SIVF operates at higher speed and with lower memory consumption than
existing algorithms. Our performance analysis reveals that SIVF achieves the
higher speed by suppressing performance degradation factors of the number of
cache misses and branch mispredictions rather than less similarity
calculations.
Related papers
- Accelerating spherical K-means clustering for large-scale sparse document data [0.7366405857677226]
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets.
We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.
arXiv Detail & Related papers (2024-11-18T05:50:58Z) - Robust Clustering on High-Dimensional Data with Stochastic Quantization [0.0]
This paper addresses the limitations of conventional vector quantization algorithms.
It investigates the Quantization (SQ) as an alternative for high-dimensionality computation.
arXiv Detail & Related papers (2024-09-03T17:13:55Z) - A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance.
We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z) - Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization.
We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably.
Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Fast conformational clustering of extensive molecular dynamics
simulation data [19.444636864515726]
We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long trajectories.
We combine two dimensionality reduction algorithms (cc_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN)
With the help of four test systems we illustrate the capability and performance of this clustering workflow.
arXiv Detail & Related papers (2023-01-11T14:36:43Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Inverted-File k-Means Clustering: Performance Analysis [1.3955252961896318]
inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes.
We experimentally demonstrate that IVF achieves better performance than the designed algorithms.
arXiv Detail & Related papers (2020-02-21T02:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.