Related papers: FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration

FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration

URL: http://arxiv.org/abs/2411.14733v1
Date: Fri, 22 Nov 2024 05:01:35 GMT
Title: FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration
Authors: Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je,
Abstract summary: We propose an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability.
Score: 7.37841083168521
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy ($2^{ENOB}$), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ's challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.

Related papers

Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving [1.9508863993381267]
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility.<n>In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs)
arXiv Detail & Related papers (2025-07-14T11:40:17Z)
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Pilot Contamination Aware Transformer for Downlink Power Control in Cell-Free Massive MIMO Networks [45.487183737784086]
This paper introduces the pilot contamination-aware power control (PAPC) transformer neural network. PAPC integrates pilot allocation data into the network, effectively handling pilot contamination scenarios. Trained in an unsupervised learning framework, PAPC is evaluated against the accelerated proximal gradient (APG) algorithm.
arXiv Detail & Related papers (2024-11-28T09:48:52Z)
SLaNC: Static LayerNorm Calibration [1.2016264781280588]
Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. In this article, we propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers.
arXiv Detail & Related papers (2024-10-14T14:32:55Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation. We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel. We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers [10.566264033360282]
Post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. In this paper, we propose a novel PTQ algorithm that balances accuracy and efficiency.
arXiv Detail & Related papers (2024-02-14T05:58:43Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units [0.0]
This paper introduces a novel, heterogeneous, mixed-signal, and mixed-precision architecture that integrates an IMAC unit with an edge TPU to enhance mobile CNN performance. We propose a unified learning algorithm that incorporates mixed-precision training techniques to mitigate potential accuracy drops when deploying models on the TPU-IMAC architecture.
arXiv Detail & Related papers (2023-04-18T19:44:56Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.