Many-body computing on Field Programmable Gate Arrays
- URL: http://arxiv.org/abs/2402.06415v1
- Date: Fri, 9 Feb 2024 14:01:02 GMT
- Title: Many-body computing on Field Programmable Gate Arrays
- Authors: Songtai Lv, Yang Liang, Yuchen Meng, Xiaochen Yao, Jincheng Xu, Yang
Liu, Qibin Zheng, Haiyuan Zou
- Abstract summary: We leverage the capabilities of Field Programmable Gate Arrays (FPGAs) for conducting quantum many-body calculations.
This has resulted in a remarkable tenfold speedup compared to CPU-based computation.
- Score: 5.612626580467746
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A new implementation of many-body calculations is of paramount importance in
the field of computational physics. In this study, we leverage the capabilities
of Field Programmable Gate Arrays (FPGAs) for conducting quantum many-body
calculations. Through the design of appropriate schemes for Monte Carlo and
tensor network methods, we effectively utilize the parallel processing
capabilities provided by FPGAs. This has resulted in a remarkable tenfold
speedup compared to CPU-based computation for a Monte Carlo algorithm. We also
demonstrate, for the first time, the utilization of FPGA to accelerate a
typical tensor network algorithm. Our findings unambiguously highlight the
significant advantages of hardware implementation and pave the way for novel
approaches to many-body calculations.
Related papers
- A High-Speed Hardware Algorithm for Modulus Operation and its Application in Prime Number Calculation [0.0]
The proposed algorithm use only addition, subtraction, logical, and bit shift operations.
It addresses scalability challenges in cryptographic applications.
The application of this algorithm in prime number calculation up to 500,000 shows its practical utility and performance advantages.
arXiv Detail & Related papers (2024-07-17T13:24:52Z) - Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations.
This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs.
Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z) - Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations.
We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery.
We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs)
We propose a new protocol for approximating TN states using realistic quantum circuits.
Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - A Deep Learning Inference Scheme Based on Pipelined Matrix
Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology.
Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z) - Accelerated Charged Particle Tracking with Graph Neural Networks on
FPGAs [0.0]
We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks.
We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing.
arXiv Detail & Related papers (2020-11-30T18:17:43Z) - An FPGA Accelerated Method for Training Feed-forward Neural Networks
Using Alternating Direction Method of Multipliers and LSMR [2.8747398859585376]
We have successfully designed, implemented, deployed and tested a novel FPGA accelerated algorithm for neural network training.
The training method is based on Alternating Direction Method of Multipliers algorithm, which has strong parallel characteristics.
We devised an FPGA accelerated version of the algorithm using Intel FPGA SDK for OpenCL and performed extensive stages followed by successful deployment of the program on an Intel Arria 10 GX FPGA.
arXiv Detail & Related papers (2020-09-06T17:33:03Z) - Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR)
CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation.
We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.