Related papers: FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration

FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration

URL: http://arxiv.org/abs/2511.12544v1
Date: Sun, 16 Nov 2025 10:39:42 GMT
Title: FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration
Authors: Mukul Lokhande, Akash Sankhe, S. V. Jaya Chand, Santosh Kumar Vishvakarma,
Abstract summary: FERMI-ML is a Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.<n>Results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.

Related papers

LFM2 Technical Report [87.58431408281973]
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities.<n>The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active)<n>We build multimodal and retrieval variants: LFM2-VL for vision-latency tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval.
arXiv Detail & Related papers (2025-11-28T17:56:35Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge Applications [20.74979295607707]
CIM macro can perform 4x4-bit MAC operations and yield 9-bit signed output. Innocent discharge branches of cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations.
arXiv Detail & Related papers (2023-07-12T06:20:19Z)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique. SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)
RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration [15.869673535117032]
Current training algorithms rely on floating-point matrix operations to meet the precision and dynamic range requirements. RedMulE is a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration. RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements.
arXiv Detail & Related papers (2023-01-10T11:07:16Z)
A Charge Domain P-8T SRAM Compute-In-Memory with Low-Cost DAC/ADC Operation for 4-bit Input Processing [4.054285623919103]
This paper presents a low cost PMOS-based 8T (P-8T) Compute-In-Memory (CIM) architecture. It efficiently per-forms the multiply-accumulate (MAC) operations between 4-bit input activations and 8-bit weights. The 256X80 P-8T CIM macro implementation using 28nm CMOS process shows the accuracies of 91.46% and 66.67%.
arXiv Detail & Related papers (2022-11-29T08:15:27Z)
A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface [16.228299091691873]
Computing-in-memory (CiM) is a promising mitigation approach by enabling multiply-accumulate operations within the memory. This work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.
arXiv Detail & Related papers (2022-11-23T07:52:10Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN Inference [27.376343943107788]
CAP-RAM is a compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro. It is presented for energy-efficient convolutional neural network (CNN) inference. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM.
arXiv Detail & Related papers (2021-07-06T04:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.