Boosting the effective performance of massively parallel tensor network
state algorithms on hybrid CPU-GPU based architectures via non-Abelian
symmetries
- URL: http://arxiv.org/abs/2309.16724v1
- Date: Sat, 23 Sep 2023 07:49:53 GMT
- Title: Boosting the effective performance of massively parallel tensor network
state algorithms on hybrid CPU-GPU based architectures via non-Abelian
symmetries
- Authors: Andor Menczer and \"Ors Legeza
- Abstract summary: Non-Abelian symmetry related tensor algebra based on Wigner-Eckhart theorem is fully detached from the conventional tensor network layer.
We have achieved an order of magnitude increase in performance with respect to results reported in arXiv:2305.05581 in terms of computational complexity.
Our solution has an estimated effective performance of 250-500 TFLOPS.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present novel algorithmic solutions together with implementation details
utilizing non-Abelian symmetries in order to boost the current limits of tensor
network state algorithms on high performance computing infrastructure. In our
in-house developed hybrid CPU-multiGPU solution scheduling is decentralized,
threads are autonomous and inter-thread communications are solely limited to
interactions with globally visible lock-free constructs. Our custom tailored
virtual memory management ensures data is produced with high spatial locality,
which together with the use of specific sequences of strided batched matrix
operations translates to significantly higher overall throughput. In order to
lower IO overhead, an adaptive buffering technique is used to dynamically match
the level of data abstraction, at which cache repositories are built and
reused, to system resources. The non-Abelian symmetry related tensor algebra
based on Wigner-Eckhart theorem is fully detached from the conventional tensor
network layer, thus massively parallel matrix and tensor operations can be
performed without additional overheads. Altogether, we have achieved an order
of magnitude increase in performance with respect to results reported in
arXiv:2305.05581 in terms of computational complexity and at the same time a
factor of three to six in the actual performance measured in TFLOPS. Benchmark
results are presented on Hilbert space dimensions up to $2.88\times10^{36}$
obtained via large-scale SU(2) spin adapted density matrix renormalization
group simulations on selected strongly correlated molecular systems. These
demonstrate the utilization of NVIDIA's highly specialized tensor cores,
leading to performance around 110 TFLOPS on a single node supplied with eight
NVIDIA A100 devices. In comparison to U(1) implementations with matching
accuracy, our solution has an estimated effective performance of 250-500
TFLOPS.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Massively Parallel Tensor Network State Algorithms on Hybrid CPU-GPU
Based Architectures [0.0]
We present novel algorithmic solutions together with implementation details to extend current limits of TNS algorithms on HPC infrastructure building.
Benchmark results are presented for selected strongly correlated molecular systems addressing problems on Hilbert space dimensions up to $2.88times1036$.
arXiv Detail & Related papers (2023-05-09T16:15:07Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - Design and Scaffolded Training of an Efficient DNN Operator for Computer
Vision on the Edge [3.3767251810292955]
FuSeConv is a drop-in replacement for depthwise separable convolutions.
FuSeConv factorizes convolution fully along their spatial and depth dimensions.
Neural Operator Scaffolding scaffolds the training of FuSeConv by distilling knowledge from depthwise separable convolutions.
arXiv Detail & Related papers (2021-08-25T19:22:25Z) - Partitioning sparse deep neural networks for scalable training and
inference [8.282177703075453]
State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements.
Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs.
The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
arXiv Detail & Related papers (2021-04-23T20:05:52Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.