FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration
- URL: http://arxiv.org/abs/2511.12544v1
- Date: Sun, 16 Nov 2025 10:39:42 GMT
- Title: FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration
- Authors: Mukul Lokhande, Akash Sankhe, S. V. Jaya Chand, Santosh Kumar Vishvakarma,
- Abstract summary: FERMI-ML is a Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.<n>Results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.
Related papers
- LFM2 Technical Report [87.58431408281973]
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities.<n>The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active)<n>We build multimodal and retrieval variants: LFM2-VL for vision-latency tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval.
arXiv Detail & Related papers (2025-11-28T17:56:35Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory
Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge
Applications [20.74979295607707]
CIM macro can perform 4x4-bit MAC operations and yield 9-bit signed output.
Innocent discharge branches of cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations.
arXiv Detail & Related papers (2023-07-12T06:20:19Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible
and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration [15.869673535117032]
Current training algorithms rely on floating-point matrix operations to meet the precision and dynamic range requirements.
RedMulE is a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration.
RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements.
arXiv Detail & Related papers (2023-01-10T11:07:16Z) - A Charge Domain P-8T SRAM Compute-In-Memory with Low-Cost DAC/ADC
Operation for 4-bit Input Processing [4.054285623919103]
This paper presents a low cost PMOS-based 8T (P-8T) Compute-In-Memory (CIM) architecture.
It efficiently per-forms the multiply-accumulate (MAC) operations between 4-bit input activations and 8-bit weights.
The 256X80 P-8T CIM macro implementation using 28nm CMOS process shows the accuracies of 91.46% and 66.67%.
arXiv Detail & Related papers (2022-11-29T08:15:27Z) - A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface [16.228299091691873]
Computing-in-memory (CiM) is a promising mitigation approach by enabling multiply-accumulate operations within the memory.
This work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.
arXiv Detail & Related papers (2022-11-23T07:52:10Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and
Precision-Programmable CNN Inference [27.376343943107788]
CAP-RAM is a compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro.
It is presented for energy-efficient convolutional neural network (CNN) inference.
A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM.
arXiv Detail & Related papers (2021-07-06T04:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.