Related papers: Many-body computing on Field Programmable Gate Arrays

Related papers

Design of an FPGA-Based Neutral Atom Rearrangement Accelerator for Quantum Computing [1.003635085077511]
Neutral atoms have emerged as a promising technology for implementing quantum computers. We propose a novel quadrant-based rearrangement algorithm that employs a divide-and-conquer strategy and also enables the simultaneous movement of multiple atoms. This is the first hardware acceleration work for atom rearrangement, and it significantly reduces the processing time.
arXiv Detail & Related papers (2024-11-19T10:38:21Z)
A High-Speed Hardware Algorithm for Modulus Operation and its Application in Prime Number Calculation [0.0]
The proposed algorithm use only addition, subtraction, logical, and bit shift operations. It addresses scalability challenges in cryptographic applications. The application of this algorithm in prime number calculation up to 500,000 shows its practical utility and performance advantages.
arXiv Detail & Related papers (2024-07-17T13:24:52Z)
Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations. This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs. Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z)
All-to-all reconfigurability with sparse and higher-order Ising machines [0.0]
We introduce a multiplexed architecture that emulates all-to-all network functionality. We show that running the adaptive parallel tempering algorithm demonstrates competitive algorithmic and prefactor advantages. scaled magnetic versions of p-bit IMs could lead to orders of magnitude improvements over the state of the art for generic optimization.
arXiv Detail & Related papers (2023-11-21T20:27:02Z)
Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields. FPGAs have seen a surge in interest for accelerating CNN inference. Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs) We propose a new protocol for approximating TN states using realistic quantum circuits. Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
An FPGA-based Solution for Convolution Operation Acceleration [0.0]
This paper proposes an FPGA-based architecture to accelerate the convolution operation. The project's purpose is to produce an FPGA IP core that can process a convolutional layer at a time.
arXiv Detail & Related papers (2022-06-09T14:12:30Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z)
Accelerated Charged Particle Tracking with Graph Neural Networks on FPGAs [0.0]
We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks. We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing.
arXiv Detail & Related papers (2020-11-30T18:17:43Z)
An FPGA Accelerated Method for Training Feed-forward Neural Networks Using Alternating Direction Method of Multipliers and LSMR [2.8747398859585376]
We have successfully designed, implemented, deployed and tested a novel FPGA accelerated algorithm for neural network training. The training method is based on Alternating Direction Method of Multipliers algorithm, which has strong parallel characteristics. We devised an FPGA accelerated version of the algorithm using Intel FPGA SDK for OpenCL and performed extensive stages followed by successful deployment of the program on an Intel Arria 10 GX FPGA.
arXiv Detail & Related papers (2020-09-06T17:33:03Z)
Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR) CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation. We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11. A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.