Related papers: A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption

A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption

URL: http://arxiv.org/abs/2509.12676v1
Date: Tue, 16 Sep 2025 05:00:57 GMT
Title: A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption
Authors: Jiaao Ma, Ceyu Xu, Lisa Wu Wills,
Abstract summary: We introduce Taurus, a hardware accelerator designed to enhance the efficiency of multi-bit TFHE computations.<n>Taurus supports ciphertexts up to 10 bits by leveraging novel FFT units and optimizing memory bandwidth through key reuse strategies.<n>Our experiment results demonstrate that Taurus achieves up to 2600x speedup over a CPU, 1200x speedup over a GPU, and up to 7x faster compared to the previous state-of-the-art accelerator.
Score: 1.4174227043241145
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In the era of cloud computing, privacy-preserving computation offloading is crucial for safeguarding sensitive data. Fully Homomorphic Encryption (FHE) enables secure processing of encrypted data, but the inherent computational complexity of FHE operations introduces significant computational overhead on the server side. FHE schemes often face a tradeoff between efficiency and versatility. While the CKKS scheme is highly efficient for polynomial operations, it lacks the flexibility of the binary TFHE (Torus-FHE) scheme, which offers greater versatility but at the cost of efficiency. The recent multi-bit TFHE extension offers greater flexibility and performance by supporting native non-polynomial operations and efficient integer processing. However, current implementations of multi-bit TFHE are constrained by its narrower numeric representation, which prevents its adoption in applications requiring wider numeric representations. To address this challenge, we introduce Taurus, a hardware accelerator designed to enhance the efficiency of multi-bit TFHE computations. Taurus supports ciphertexts up to 10 bits by leveraging novel FFT units and optimizing memory bandwidth through key reuse strategies. We also propose a compiler with operation deduplication to improve memory utilization. Our experiment results demonstrate that Taurus achieves up to 2600x speedup over a CPU, 1200x speedup over a GPU, and up to 7x faster compared to the previous state-of-the-art TFHE accelerator. Moreover, Taurus is the first accelerator to demonstrate privacy-preserving inference with large language models such as GPT-2. These advancements enable more practical and scalable applications of privacy-preserving computation in cloud environments.

Related papers

Towards a Functionally Complete and Parameterizable TFHE Processor [3.907410857035328]
TFHE is a fast torus-based fully homomorphic encryption scheme.<n>It provides the fastest bootstrapping operation performance of any other FHE scheme.<n>It suffers from a considerably higher computational overhead for the evaluation of homomorphic circuits.<n>We propose an FPGA-based hardware accelerator for the evaluation of homomorphic circuits.
arXiv Detail & Related papers (2025-10-27T16:16:40Z)
FedBit: Accelerating Privacy-Preserving Federated Learning via Bit-Interleaved Packing and Cross-Layer Co-Design [2.255961793913651]
Federated learning (FL) with fully homomorphic encryption (FHE) effectively safeguards data privacy during model aggregation.<n>FedBit is a hardware/software co-designed framework for the Brakerski-Fan-Vercauteren (BFV) scheme.<n>FedBit employs bit-interleaved data packing to embed multiple model parameters into a single ciphertext coefficient.
arXiv Detail & Related papers (2025-09-27T03:58:16Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [108.0657508755532]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
PiT: Progressive Diffusion Transformer [50.46345527963736]
Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture.<n>We find that DiTs do not rely as heavily on global information as previously believed.<n>We propose a series of Pseudo Progressive Diffusion Transformer (PiT)
arXiv Detail & Related papers (2025-05-19T15:02:33Z)
EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform [15.3973190088728]
EFFACT is a highly efficient full-stack FHE acceleration platform with a compiler that provides comprehensive optimizations and vector-friendly hardware.<n>For generality, EFFACT is also equipped with an ISA and a compiler backend that can support several FHE schemes like CKKS, BGV, and BFV.
arXiv Detail & Related papers (2025-04-22T12:01:20Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
CIPHERMATCH: Accelerating Homomorphic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing [8.114331115730021]
Homomorphic encryption (HE) allows secure computation on encrypted data without revealing the original data.<n>Many cloud computing applications (e.g., DNA read mapping, biometric matching, web search) use exact string matching as a key operation.<n>Prior string matching algorithms that use homomorphic encryption are limited by high computational latency.
arXiv Detail & Related papers (2025-03-12T00:25:58Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
FHEmem: A Processing In-Memory Accelerator for Fully Homomorphic Encryption [9.884698447131374]
Homomorphic Encryption (FHE) is a technique that allows arbitrary computations to be performed on encrypted data without the need for decryption. FHE is significantly slower than computation on plain data due to the increase in data size after encryption. We propose a PIM-based FHE accelerator, FHEmem, which exploits a novel processing in-memory architecture.
arXiv Detail & Related papers (2023-11-27T20:11:38Z)
SOCI^+: An Enhanced Toolkit for Secure OutsourcedComputation on Integers [50.608828039206365]
We propose SOCI+ which significantly improves the performance of SOCI. SOCI+ employs a novel (2, 2)-threshold Paillier cryptosystem with fast encryption and decryption as its cryptographic primitive. Compared with SOCI, our experimental evaluation shows that SOCI+ is up to 5.4 times more efficient in computation and 40% less in communication overhead.
arXiv Detail & Related papers (2023-09-27T05:19:32Z)
REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption [4.713756093611972]
We present the first-of-its-kind multi-chiplet-based FHE accelerator REED' for overcoming the limitations of prior monolithic designs.<n>Results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$2$ chip area, 49.4 W average power in 7nm technology.
arXiv Detail & Related papers (2023-08-05T14:04:39Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.