Structured Inverted-File k-Means Clustering for High-Dimensional Sparse
Data
- URL: http://arxiv.org/abs/2103.16141v1
- Date: Tue, 30 Mar 2021 07:54:02 GMT
- Title: Structured Inverted-File k-Means Clustering for High-Dimensional Sparse
Data
- Authors: Kazuo Aoyama and Kazumi Saito
- Abstract summary: This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set.
Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions.
- Score: 2.487445341407889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an architecture-friendly k-means clustering algorithm
called SIVF for a large-scale and high-dimensional sparse data set. Algorithm
efficiency on time is often measured by the number of costly operations such as
similarity calculations. In practice, however, it depends greatly on how the
algorithm adapts to an architecture of the computer system which it is executed
on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to
decrease the number of similarity calculations between a data object and
centroids of all the clusters. To maximize the ICP performance, SIVF exploits
for a centroid set an inverted-file that is structured so as to reduce pipeline
hazards. We demonstrate in our experiments on real large-scale document data
sets that SIVF operates at higher speed and with lower memory consumption than
existing algorithms. Our performance analysis reveals that SIVF achieves the
higher speed by suppressing performance degradation factors of the number of
cache misses and branch mispredictions rather than less similarity
calculations.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance.
We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z) - Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization.
We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably.
Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Fast conformational clustering of extensive molecular dynamics
simulation data [19.444636864515726]
We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long trajectories.
We combine two dimensionality reduction algorithms (cc_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN)
With the help of four test systems we illustrate the capability and performance of this clustering workflow.
arXiv Detail & Related papers (2023-01-11T14:36:43Z) - Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties [10.962094053749093]
leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials.
We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
arXiv Detail & Related papers (2022-06-17T15:39:39Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z) - Fast and Robust Iterative Closest Point [32.42799285301607]
Iterative Closest Point (ICP) is a fundamental technique for rigid registration between two point sets.
Recent work such as Sparse ICP achieves robustness via sparsity optimization at the cost of computational speed.
We show that the classical point-to-point ICP can be treated as a majorization-minimization (MM) algorithm, and propose an Anderson acceleration approach to speed up its convergence.
arXiv Detail & Related papers (2020-07-15T11:32:53Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Inverted-File k-Means Clustering: Performance Analysis [1.3955252961896318]
inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes.
We experimentally demonstrate that IVF achieves better performance than the designed algorithms.
arXiv Detail & Related papers (2020-02-21T02:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.