Related papers: Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs

Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs

URL: http://arxiv.org/abs/2212.11506v1
Date: Thu, 22 Dec 2022 06:38:40 GMT
Title: Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs
Authors: Narendra Chaudhary, Alexander Pivovar, Pavel Yakovlev, Andrey Gorshkov, Sanchit Misra
Abstract summary: t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. BH t-SNE algorithm is inefficient on existing CPU implementations. Acc-t-SNE is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py.
Score: 59.18990342943095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. Most standard packages of t-SNE, such as scikit-learn, use the Barnes-Hut t-SNE (BH t-SNE) algorithm for large datasets. However, existing CPU implementations of this algorithm are inefficient. In this work, we accelerate the BH t-SNE on CPUs via cache optimizations, SIMD, parallelizing sequential steps, and improving parallelization of multithreaded steps. Our implementation (Acc-t-SNE) is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py, respectively, on a 32-core Intel(R) Icelake cloud instance.

Related papers

Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs. USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z)
PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search [8.15681999722805]
This paper studies density-based clustering of point sets. It unifies the different variants of density peaks clustering into a single framework, PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z)
A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z)
Massively Parallel Continuous Local Search for Hybrid SAT Solving on GPUs [5.245714076090567]
We propose FastFourierSAT, a highly parallel hybrid SAT solver based on gradient-driven continuous local search. Our results show that FastFourierSAT computes the gradient 100+ times faster than previous prototypes implemented on CPU. FastFourierSAT solves most instances and demonstrates promising performance on larger-size instances.
arXiv Detail & Related papers (2023-08-29T04:50:07Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Fast Parallel Bayesian Network Structure Learning [37.46185698921754]
We propose a fast solution named Fast-BNS on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS is powered by a series of efficiency optimizations including grouping the CI tests of the edges with the same endpoints to reduce the number of unnecessary CI tests. A comprehensive experimental study shows that the sequential version of Fast-BNS is up to 50 times faster than its counterpart.
arXiv Detail & Related papers (2022-12-08T13:17:02Z)
Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue. We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues. Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z)
Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems. Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.