Related papers: TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

URL: http://arxiv.org/abs/2504.16266v2
Date: Thu, 24 Apr 2025 18:14:11 GMT
Title: TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs
Authors: Ye Qiao, Zhiheng Chen, Yifan Zhang, Yian Wang, Sitao Huang,
Abstract summary: TeLLMe is the first ternary LLM accelerator for low-power FPGAs.<n>It supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations.<n>Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts.
Score: 5.889337608109388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.

Related papers

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer [0.0]
Transformer-based large language models rely heavily on matrix multiplications for attention and feed-forward layers.<n>We introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA.<n>Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine.
arXiv Detail & Related papers (2025-03-20T22:15:42Z)
Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget.<n>The limited eFPGA logic and memory significantly constrain compute capabilities and model size.<n>The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z)
A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs [0.0]
ADAPTOR is a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs.<n>It incorporates efficient matrix tiling to distribute resources across FPGA platforms.<n>It achieves a speedup of 1.7 to 2.25$times$ compared to some state-of-the-art FPGA-based accelerators.
arXiv Detail & Related papers (2024-11-27T08:53:19Z)
ProTEA: Programmable Transformer Encoder Acceleration on FPGA [0.0]
Transformer neural networks (TNNs) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV) Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. This paper introduces textitProTEA, a programmable runtime accelerator tailored for the dense computations of state-of-the-art transformer encoders.
arXiv Detail & Related papers (2024-09-21T01:44:13Z)
Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv Detail & Related papers (2024-08-01T11:06:05Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs. At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE [0.8403582577557918]
Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. We propose a lightweight hybrid model which uses Neural ODE as a backbone instead of ResNet. The proposed model is deployed on a modest-sized FPGA device for edge computing.
arXiv Detail & Related papers (2024-01-05T09:32:39Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration [15.869673535117032]
Current training algorithms rely on floating-point matrix operations to meet the precision and dynamic range requirements. RedMulE is a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration. RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements.
arXiv Detail & Related papers (2023-01-10T11:07:16Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.