PACOX: A FPGA-based Pauli Composer Accelerator for Pauli String Computation
- URL: http://arxiv.org/abs/2601.04827v1
- Date: Thu, 08 Jan 2026 11:04:57 GMT
- Title: PACOX: A FPGA-based Pauli Composer Accelerator for Pauli String Computation
- Authors: Tran Xuan Hieu Le, Tuan Hai Vu, Vu Trung Duong Le, Hoai Luan Pham, Yasuhiko Nakashima,
- Abstract summary: Pauli strings are a computational primitive in hybrid quantum-classical algorithms.<n>PACOX is the first dedicated FPGA-based accelerator for Pauli strings.<n>Experiments show that PACOX achieves speedups of up to 100 times compared with state-of-the-art CPU-based methods.
- Score: 0.8481798330936976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pauli strings are a fundamental computational primitive in hybrid quantum-classical algorithms. However, classical computation of Pauli strings suffers from exponential complexity and quickly becomes a performance bottleneck as the number of qubits increases. To address this challenge, this paper proposes the Pauli Composer Accelerator (PACOX), the first dedicated FPGA-based accelerator for Pauli string computation. PACOX employs a compact binary encoding with XOR-based index permutation and phase accumulation. Based on this formulation, we design a parallel and pipelined processing element (PE) cluster architecture that efficiently exploits data-level parallelism on FPGA. Experimental results on a Xilinx ZCU102 FPGA show that PACOX operates at 250 MHz with a dynamic power consumption of 0.33 W, using 8,052 LUTs, 10,934 FFs, and 324 BRAMs. For Pauli strings of up to 19 qubits, PACOX achieves speedups of up to 100 times compared with state-of-the-art CPU-based methods, while requiring significantly less memory and achieving a much lower power-delay product. These results demonstrate that PACOX delivers high computational speed with superior energy efficiency for Pauli-based workloads in hybrid quantum-classical systems.
Related papers
- PauliEngine: High-Performant Symbolic Arithmetic for Quantum Operations [39.36424353588699]
We introduce PauliEngine, a high-performance C++ framework that provides efficient primitives for Pauli string, commutators, symbolic phase tracking, and structural transformations.<n>PauliEngine supports both numerical and symbolic coefficients and is accessible through a Python interface.
arXiv Detail & Related papers (2026-01-05T16:00:44Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Pushing the Envelope of LLM Inference on AI-PC [45.081663877447816]
ultra-low-bit models (1/1.58/2-bit) match the perplexity and end-task performance of their full-precision counterparts using the same model size.<n>The computational efficiency of state-of-the-art inference runtimes (e.g. bitnet) used to deploy them remains underexplored.<n>We take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency.<n>We present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet
arXiv Detail & Related papers (2025-08-08T23:33:38Z) - On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration [1.9965524232168244]
This paper presents an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform.<n>We propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks.<n>Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.
arXiv Detail & Related papers (2025-04-24T08:50:01Z) - APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework.<n>APB uses multi-host approximate attention to enhance prefill speed.<n>APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z) - HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation [0.0]
HEPPO-GAE is an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation stage in Proximal Policy Optimization.<n>Key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization.<n>Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency.
arXiv Detail & Related papers (2025-01-22T08:18:56Z) - Design of an FPGA-Based Neutral Atom Rearrangement Accelerator for Quantum Computing [1.003635085077511]
Neutral atoms have emerged as a promising technology for implementing quantum computers.
We propose a novel quadrant-based rearrangement algorithm that employs a divide-and-conquer strategy and also enables the simultaneous movement of multiple atoms.
This is the first hardware acceleration work for atom rearrangement, and it significantly reduces the processing time.
arXiv Detail & Related papers (2024-11-19T10:38:21Z) - On the Constant Depth Implementation of Pauli Exponentials [49.48516314472825]
We decompose $Zotimes n$ exponentials of arbitrary length into circuits of constant depth using $mathcalO(n)$ ancillae and two-body XX and ZZ interactions.<n>We prove the correctness of our approach, after introducing novel rewrite rules for circuits which benefit from qubit recycling.
arXiv Detail & Related papers (2024-08-15T17:09:08Z) - Design optimization for high-performance computing using FPGA [0.0]
We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR.
Running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point.
The proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz.
arXiv Detail & Related papers (2023-04-24T22:20:42Z) - RAMP: A Flat Nanosecond Optical Network and MPI Operations for
Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP.
RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z) - Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs)
We propose a new protocol for approximating TN states using realistic quantum circuits.
Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.