ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads
- URL: http://arxiv.org/abs/2503.15916v2
- Date: Tue, 27 May 2025 07:03:31 GMT
- Title: ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads
- Authors: Fangxin Liu, Haomin Li, Zongwu Wang, Bo Zhang, Mingzhe Zhang, Shoumeng Yan, Li Jiang, Haibing Guan,
- Abstract summary: High-bit-width operations are crucial for enhancing security.<n>They are computationally intensive due to the large number of modular operations required.<n>AllMod is a novel approach that improves the area efficiency of LUT-based large-number modular reduction.
- Score: 18.634794494170617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.
Related papers
- FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption [2.7777199166440827]
Fully Homomorphic Encryption (FHE) enables computation directly on encrypted data but incurs massive computational and memory overheads.<n>Custom accelerators can mitigate these costs, but their long time-to-market and the rapid evolution of FHE algorithms threaten their long-term relevance.<n>We propose FHECore, a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor.
arXiv Detail & Related papers (2026-02-10T02:55:10Z) - StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold [51.93627542334909]
Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models.<n>We propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U!SVtop$.
arXiv Detail & Related papers (2025-10-02T11:59:13Z) - MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models [53.36415620647177]
Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
arXiv Detail & Related papers (2025-06-15T15:02:59Z) - Algorithms for the Shortest Vector Problem in $2$-dimensional Lattices, Revisited [4.843809993270313]
Efficiently solving the Shortest Vector Problem (SVP) in two-dimensional lattices holds practical significance in cryptography and computational geometry.
We develop an efficient adaptive lattice reduction algorithm textbfCrossEuc that strategically applies the Euclidean algorithm across dimensions.
By iteratively invoking textbfHVec, our optimized algorithm textbfHVecSBP achieves a reduced basis in $O(log n M(n) )$ time for arbitrary input bases with bit-length $n$.
arXiv Detail & Related papers (2025-04-17T13:50:51Z) - Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator [11.167930856636161]
We introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs.<n>We show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$$7.0times$ and $1.5$$146.1times$, respectively.
arXiv Detail & Related papers (2025-01-18T05:27:25Z) - Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference [10.608817382813786]
Mixed-precision matrix (mpGEMM) is an important yet underexplored operation involving the multiplication of lower-precision weights with higher-precision activations.<n>Off-the-shelf hardware does not support this operation, leading to indirect, thus inefficient, dequantization-based implementations.<n>We study the lookup table (LUT)-based approach for mpGEMM and find that a conventional LUT implementation fails to achieve the promised gains.
arXiv Detail & Related papers (2024-08-12T08:52:14Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Low-Complexity Integer Divider Architecture for Homomorphic Encryption [5.857929080874288]
Homomorphic encryption (HE) allows computations to be directly carried out on ciphertexts and enables privacy-preserving cloud computing.
An algorithm is proposed to compute the quotient and vigorous mathematical proofs are provided.
arXiv Detail & Related papers (2024-01-19T23:53:59Z) - Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli [3.217374402111224]
We propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli.
Our method can be executed in constant time, which is essential for cryptography applications.
arXiv Detail & Related papers (2023-08-29T07:26:20Z) - Reconstructed Convolution Module Based Look-Up Tables for Efficient
Image Super-Resolution [9.715421499605934]
Look-up table(LUT)-based methods have shown the great efficacy in single image super-resolution (SR) task.
Previous methods ignore the essential reason of restricted receptive field (RF) size in LUT.
We propose a novel Reconstructed Convolution(RC) module, which decouples channel-wise and spatial calculation.
arXiv Detail & Related papers (2023-07-17T15:04:00Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Quantum Goemans-Williamson Algorithm with the Hadamard Test and
Approximate Amplitude Constraints [62.72309460291971]
We introduce a variational quantum algorithm for Goemans-Williamson algorithm that uses only $n+1$ qubits.
Efficient optimization is achieved by encoding the objective matrix as a properly parameterized unitary conditioned on an auxilary qubit.
We demonstrate the effectiveness of our protocol by devising an efficient quantum implementation of the Goemans-Williamson algorithm for various NP-hard problems.
arXiv Detail & Related papers (2022-06-30T03:15:23Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - 1$\times$N Block Pattern for Network Sparsity [90.43191747596491]
We propose one novel concept of $1times N$ block sparsity pattern (block pruning) to break this limitation.
Our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2.
It also obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning.
arXiv Detail & Related papers (2021-05-31T05:50:33Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.