Related papers: SPADE: A SIMD Posit-enabled compute engine for Accelerating DNN Efficiency

SPADE: A SIMD Posit-enabled compute engine for Accelerating DNN Efficiency

URL: http://arxiv.org/abs/2601.17279v1
Date: Sat, 24 Jan 2026 03:38:11 GMT
Title: SPADE: A SIMD Posit-enabled compute engine for Accelerating DNN Efficiency
Authors: Sonu Kumar, Lavanya Vinnakota, Mukul Lokhande, Santosh Kumar Vishvakarma, Adam Teman,
Abstract summary: This work presents SPADE, a unified multi-precision SIMD Posit-based multiplyaccumulate (MAC) architecture.<n>Unlike prior single-precision or floating/fixed-point SIMD MACs, SPADE introduces a regime-aware, lane-fused SIMD Posit datapath.<n> FPGA implementation on a Xilinx Virtex-7 shows 45.13% LUT and 80% slice reduction for Posit (8,0), and up to 28.44% and 17.47% improvement for Posit (16,1) and Posit (32,2) over prior work.
Score: 0.12314765641075437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing demand for edge-AI systems requires arithmetic units that balance numerical precision, energy efficiency, and compact hardware while supporting diverse formats. Posit arithmetic offers advantages over floating- and fixed-point representations through its tapered precision, wide dynamic range, and improved numerical robustness. This work presents SPADE, a unified multi-precision SIMD Posit-based multiplyaccumulate (MAC) architecture supporting Posit (8,0), Posit (16,1), and Posit (32,2) within a single framework. Unlike prior single-precision or floating/fixed-point SIMD MACs, SPADE introduces a regime-aware, lane-fused SIMD Posit datapath that hierarchically reuses Posit-specific submodules (LOD, complementor, shifter, and multiplier) across 8/16/32-bit precisions without datapath replication. FPGA implementation on a Xilinx Virtex-7 shows 45.13% LUT and 80% slice reduction for Posit (8,0), and up to 28.44% and 17.47% improvement for Posit (16,1) and Posit (32,2) over prior work, with only 6.9% LUT and 14.9% register overhead for multi-precision support. ASIC results across TSMC nodes achieve 1.38 GHz at 6.1 mW (28 nm). Evaluation on MNIST, CIFAR-10/100, and alphabet datasets confirms competitive inference accuracy.

Related papers

A Deployment-Friendly Foundational Framework for Efficient Computational Pathology [48.3868019137117]
We present LitePath, a framework designed to mitigate model over- parameterization and patch level redundancy.<n> LitePath integrates LiteFM, a compact model distilled from three large PFMs, using 190 million patches.<n> LitePath processes 208 slides per hour, 104.5x faster than Virchow2, and consumes 0.36 kWh per 3,000 slides.
arXiv Detail & Related papers (2026-02-15T06:31:50Z)
DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models [8.92683306412944]
This paper introduces a digital CIM (DS-CIM) architecture that achieves both high accuracy and efficiency.<n>We implement multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation.<n>Our core strategy, a shared Random Number Generator (PRNG) with 2D, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions.
arXiv Detail & Related papers (2026-01-10T23:56:33Z)
Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields [51.95157731126864]
Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost.<n>This thesis aims to make MACE cheaper and faster by identifying computational bottlenecks and evaluating low-precision execution policies.
arXiv Detail & Related papers (2025-10-23T14:02:34Z)
Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads [0.0]
This work proposes a flexible and SIMD multiprecision processing element (FlexPE)<n>The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware.
arXiv Detail & Related papers (2024-12-16T12:25:57Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE) ParFormer improves feature extraction by combining convolutional and attention mechanisms. For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S. The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators [2.6487352458568507]
We propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions. We show how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support.
arXiv Detail & Related papers (2021-01-27T23:57:43Z)
Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design. Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars. EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy [3.4154033825543055]
This paper presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy. The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits.
arXiv Detail & Related papers (2020-11-02T17:40:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.