Related papers: Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference

Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference

URL: http://arxiv.org/abs/2507.09010v1
Date: Fri, 11 Jul 2025 20:27:30 GMT
Title: Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference
Authors: Chun-Ting Chen, HanGyeol Mun, Jian Meng, Mohamed S. Abdelfattah, Jae-sun Seo,
Abstract summary: Edge accelerator should achieve high area efficiency and minimize external memory access.<n>This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array architecture.<n>Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios.
Score: 8.475319961845903
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Edge inference for large language models (LLM) offers secure, low-latency, and cost-effective inference solutions. We emphasize that an edge accelerator should achieve high area efficiency and minimize external memory access (EMA) during the memory-bound decode stage, while maintaining high energy efficiency during the compute intensive prefill stage. This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array (HSA) architecture that optimizes inference efficiency in both stages. To further reduce EMA, we adopt MXINT4 weight quantization and propose an optimized dataflow tailored for HSA, ensuring negligible dequantization overhead and achieving 100% hardware utilization with minimal accuracy loss under edge DRAM bandwidth constraints. For non-linear operations, we incorporate optimized root mean square normalization (RMSNorm) and rotary position embedding (RoPE) units, reducing their latency, area, and memory access overhead while enabling end-to-end inference on our accelerator. Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios, providing >2.45x/13.5x improvement over existing approaches, while maintaining superior energy efficiency in token generation.

Related papers

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z)
HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference [8.057006406834462]
Large Language Models (LLMs) have driven a growing demand for efficient inference in latency-sensitive applications.<n>We present HALO, a heterogeneous memory-centric accelerator for these challenges.<n>We show that HALO achieves up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT.
arXiv Detail & Related papers (2025-10-03T02:20:17Z)
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive [58.0729162588429]
Interactive segmentation improves annotation efficiency by segmenting target regions from user prompts.<n>Current approaches face a critical trade-off: dense-token methods achieve superior accuracy but suffer from prohibitively slow processing on CPU devices.<n>We propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing.
arXiv Detail & Related papers (2025-07-13T12:33:37Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [12.993197799897532]
Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints.<n>We propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap.
arXiv Detail & Related papers (2025-04-08T09:17:35Z)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [2.7309692684728613]
Large language models (LLMs) are typically served from clusters of GPU/NPUs that consist of large number of devices.<n>Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations.<n>We propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip memory to the on-chip cache of AI accelerators.
arXiv Detail & Related papers (2025-01-14T15:14:10Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting [12.006890185810322]
We introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning.
arXiv Detail & Related papers (2024-06-22T06:51:47Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
Multipoint-BAX: A New Approach for Efficiently Tuning Particle Accelerator Emittance via Virtual Objectives [47.52324722637079]
We propose a new information-theoretic algorithm, Multipoint-BAX, for black-box optimization on multipoint queries. We use Multipoint-BAX to minimize emittance at the Linac Coherent Light Source (LCLS) and the Facility for Advanced Accelerator Experimental Tests II (FACET-II)
arXiv Detail & Related papers (2022-09-10T04:01:23Z)
ATTACC the Quadratic Bottleneck of Attention Layers [3.2741800634280245]
This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
arXiv Detail & Related papers (2021-07-13T22:23:40Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.