Analog Foundation Models
- URL: http://arxiv.org/abs/2505.09663v2
- Date: Fri, 16 May 2025 15:24:45 GMT
- Title: Analog Foundation Models
- Authors: Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian,
- Abstract summary: Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network computations.<n>AIMC introduces fundamental challenges such as noisy computations and strict inference on input and quantization.<n>We introduce a general scalable method to robustly adapt and execute on low-precision analog hardware.
- Score: 6.589590906512612
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
Related papers
- Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z) - Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach [18.47703842449581]
We show how an ML system with millions of parameters would behave on memristor hardware.<n>We limit the relative degradation in word error rate to 25% when using a 3-bit weight precision to execute linear operations.
arXiv Detail & Related papers (2025-05-30T15:42:41Z) - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs [10.036727981085223]
SplitQuantV2 is an innovative algorithm designed to enhance low-bit linear quantization of large language models.<n>It can achieve results comparable to those of advanced algorithms.
arXiv Detail & Related papers (2025-03-07T14:59:07Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - QuEST: Stable Training of LLMs with 1-Bit Weights and Activations [27.644652093888745]
QuEST is a new method for training sparse or quantized language models.<n>We show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions.<n>We provide GPU kernel support showing that models produced by QuEST can be executed efficiently.
arXiv Detail & Related papers (2025-02-07T15:23:34Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - AnalogNAS: A Neural Network Design Framework for Accurate Inference with
Analog In-Memory Computing [7.596833322764203]
Inference at the edge requires low latency, compact and power-efficient models.
analog/mixed signal in-memory computing hardware accelerators can easily transcend the memory wall of von Neuman architectures.
We propose AnalogNAS, a framework for automated Deep Neural Network (DNN) design targeting deployment on analog In-Memory Computing (IMC) inference accelerators.
arXiv Detail & Related papers (2023-05-17T07:39:14Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Neural Network Quantization with AI Model Efficiency Toolkit (AIMET) [15.439669159557253]
We present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET)
AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization.
We provide a practical guide to quantization via AIMET by covering PTQ and QAT, code examples and practical tips.
arXiv Detail & Related papers (2022-01-20T20:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.