Related papers: APINT: A Full-Stack Framework for Acceleration of Privacy-Preserving Inference of Transformers based on Garbled Circuits

APINT: A Full-Stack Framework for Acceleration of Privacy-Preserving Inference of Transformers based on Garbled Circuits

URL: http://arxiv.org/abs/2502.16877v1
Date: Mon, 24 Feb 2025 06:26:42 GMT
Title: APINT: A Full-Stack Framework for Acceleration of Privacy-Preserving Inference of Transformers based on Garbled Circuits
Authors: Hyunjun Cho, Jaeho Jeon, Jaehoon Heo, Joo-Young Kim,
Abstract summary: APINT is a full-stack framework designed to reduce PiT's overall latency.<n>APINT features a novel protocol that reallocates possible GC workloads to alternative methods.<n>It also suggests GC-friendly circuit generation that reduces the number of AND gates at the most.
Score: 1.0499611180329804
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As the importance of Privacy-Preserving Inference of Transformers (PiT) increases, a hybrid protocol that integrates Garbled Circuits (GC) and Homomorphic Encryption (HE) is emerging for its implementation. While this protocol is preferred for its ability to maintain accuracy, it has a severe drawback of excessive latency. To address this, existing protocols primarily focused on reducing HE latency, thus making GC the new latency bottleneck. Furthermore, previous studies only focused on individual computing layers, such as protocol or hardware accelerator, lacking a comprehensive solution at the system level. This paper presents APINT, a full-stack framework designed to reduce PiT's overall latency by addressing the latency problem of GC through both software and hardware solutions. APINT features a novel protocol that reallocates possible GC workloads to alternative methods (i.e., HE or standard matrix operation), substantially decreasing the GC workload. It also suggests GC-friendly circuit generation that reduces the number of AND gates at the most, which is the expensive operator in GC. Furthermore, APINT proposes an innovative netlist scheduling that combines coarse-grained operation mapping and fine-grained scheduling for maximal data reuse and minimal dependency. Finally, APINT's hardware accelerator, combined with its compiler speculation, effectively resolves the memory stall issue. Putting it all together, APINT achieves a remarkable end-to-end reduction in latency, outperforming the existing protocol on CPU platform by 12.2x online and 2.2x offline. Meanwhile, the APINT accelerator not only reduces its latency by 3.3x but also saves energy consumption by 4.6x while operating PiT compared to the state-of-the-art GC accelerator.

Related papers

When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning [0.0]
Distributed Machine Learning (DML) on resource-constrained edge devices holds immense potential for real-world applications. This paper proposes Hermes, a novel probabilistic framework for efficient DML on edge devices. Our evaluations on a real-world heterogeneous resource-constrained environment demonstrate that Hermes achieves faster convergence compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-27T16:17:03Z)
Rethinking and Accelerating Graph Condensation: A Training-Free Approach with Class Partition [49.41718583061147]
Graph condensation is a data-centric solution to replace the large graph with a small yet informative condensed graph.<n>Existing GC methods suffer from intricate optimization processes, necessitating excessive computing resources and training time.<n>We propose a training-free GC framework termed Class-partitioned Graph Condensation (CGC)<n>CGC condenses the Ogbn-products graph within 30 seconds, achieving a speedup ranging from $102$X to $104$X and increasing accuracy by up to 4.2%.
arXiv Detail & Related papers (2024-05-22T14:57:09Z)
Banyan: Fast Rotating Leader BFT [20.52947785138998]
Banyan is the first rotating leader state machine replication protocol that allows transactions to be confirmed in just a single round-trip time.<n>We introduce a novel dual mode mechanism that enables optimal block finalization latency in the fast path.<n>Our evaluation reveals that Banyan reduces latency by up to 30% compared to state-of-the-art protocols.
arXiv Detail & Related papers (2023-12-10T12:32:58Z)
Check-Agnosia based Post-Processor for Message-Passing Decoding of Quantum LDPC Codes [3.4602940992970908]
We introduce a new post-processing algorithm with a hardware-friendly orientation, providing error correction performance competitive to the state-of-the-art techniques. We show that latency values close to one microsecond can be obtained on the FPGA board, and provide evidence that much lower latency values can be obtained for ASIC implementations.
arXiv Detail & Related papers (2023-10-23T14:51:22Z)
Architecture and protocols for all-photonic quantum repeaters [0.49157446832511503]
All-photonic quantum repeater scheme promises resilience to photon losses and operational errors. We propose a new emitter-photonic qubit building block and an RGS protocol that addresses several key considerations. Our proposed building block significantly reduces the total number of emissive quantum memories required for end nodes.
arXiv Detail & Related papers (2023-06-06T15:08:50Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference [23.795457990555878]
Secure multi-party computation (MPC) has been discussed, to enable the privacy-preserving deep learning (DL) computation. MPCs often come at very high computation overhead, and potentially prohibit their popularity in large scale systems. In this work, we develop a systematic framework, PolyMPCNet, of joint overhead reduction of MPC comparison protocol and hardware acceleration.
arXiv Detail & Related papers (2022-09-20T02:47:37Z)
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design. It is currently not possible to deploy AI models on network devices due to their limited computational capabilities. We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.