Related papers: Distributed Out-of-Memory NMF on CPU/GPU Architectures

Distributed Out-of-Memory NMF on CPU/GPU Architectures

URL: http://arxiv.org/abs/2202.09518v4
Date: Tue, 12 Sep 2023 23:16:07 GMT
Title: Distributed Out-of-Memory NMF on CPU/GPU Architectures
Authors: Ismael Boureima, Manish Bhattarai, Maksim Eren, Erik Skau, Philip Romero, Stephan Eidenbenz, Boian Alexandrov
Abstract summary: We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems. Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
Score: 1.0051474951635875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.

Related papers

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation [7.204881999658682]
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value ( KV) caching is used to store intermediate activations. The memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU.
arXiv Detail & Related papers (2024-11-26T04:03:14Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
Boosting the effective performance of massively parallel tensor network state algorithms on hybrid CPU-GPU based architectures via non-Abelian symmetries [0.0]
Non-Abelian symmetry related tensor algebra based on Wigner-Eckhart theorem is fully detached from the conventional tensor network layer. We have achieved an order of magnitude increase in performance with respect to results reported in arXiv:2305.05581 in terms of computational complexity. Our solution has an estimated effective performance of 250-500 TFLOPS.
arXiv Detail & Related papers (2023-09-23T07:49:53Z)
Towards Memory-Efficient Training for Extremely Large Output Spaces -- Learning with 500k Labels on a Single Commodity GPU [2.3224617218247134]
In classification problems with large output spaces (up to millions of labels), the last layer can require an enormous amount of memory. Using sparse connectivity would drastically reduce the memory requirements, but it can result in much diminished predictive performance of the model. We show that a proposed approach can scale to datasets with 670,000 labels on a single GPU with only 4GB memory.
arXiv Detail & Related papers (2023-06-06T14:44:52Z)
HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance. We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs [59.18990342943095]
t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. BH t-SNE algorithm is inefficient on existing CPU implementations. Acc-t-SNE is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py.
arXiv Detail & Related papers (2022-12-22T06:38:40Z)
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality. A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent. We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters. Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.