Related papers: Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

URL: http://arxiv.org/abs/2103.16141v1
Date: Tue, 30 Mar 2021 07:54:02 GMT
Title: Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data
Authors: Kazuo Aoyama and Kazumi Saito
Abstract summary: This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions.
Score: 2.487445341407889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.

Related papers

An Efficient Outlier Detection Algorithm for Data Streaming [51.56874851156008]
Traditional outlier detection methods, such as the Local Outlier Factor (LOF) algorithm, struggle with real-time data. We propose a novel approach to enhance the efficiency of LOF algorithms for online anomaly detection, named the Efficient Incremental LOF (EILOF) algorithm. The EILOF algorithm not only significantly reduces computational costs, but also systematically improves detection accuracy when the number of additional points increases.
arXiv Detail & Related papers (2025-01-02T05:12:43Z)
Accelerating spherical K-means clustering for large-scale sparse document data [0.7366405857677226]
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.
arXiv Detail & Related papers (2024-11-18T05:50:58Z)
Robust Clustering on High-Dimensional Data with Stochastic Quantization [0.0]
This paper addresses the limitations of conventional vector quantization algorithms. It investigates the Quantization (SQ) as an alternative for high-dimensionality computation.
arXiv Detail & Related papers (2024-09-03T17:13:55Z)
A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions. We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm. Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z)
Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z)
Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization. We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably. Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Fast conformational clustering of extensive molecular dynamics simulation data [19.444636864515726]
We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long trajectories. We combine two dimensionality reduction algorithms (cc_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN) With the help of four test systems we illustrate the capability and performance of this clustering workflow.
arXiv Detail & Related papers (2023-01-11T14:36:43Z)
Large-scale Optimization of Partial AUC in a Range of False Positive Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique. Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z)
AsySQN: Faster Vertical Federated Learning Algorithms with Better Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL) The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly. We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
Inverted-File k-Means Clustering: Performance Analysis [1.3955252961896318]
inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes. We experimentally demonstrate that IVF achieves better performance than the designed algorithms.
arXiv Detail & Related papers (2020-02-21T02:20:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.