Related papers: Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

URL: http://arxiv.org/abs/2504.14365v1
Date: Sat, 19 Apr 2025 17:47:01 GMT
Title: Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
Authors: Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna,
Abstract summary: Large language model (LLM) pruning with fixed N:M structured sparsity limits the expressivity of the sparse model.<n>We present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method.<n>We then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM)
Score: 5.985414012866983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

Related papers

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism [9.93378263858092]
Multimodal large language models (MLLMs) handle images, videos, and audio by incorporating feature extractors and projection modules.<n>Current tightly coupled serving architectures struggle to distinguish between mixed request types.<n>We propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity.
arXiv Detail & Related papers (2025-07-14T08:53:48Z)
Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization [18.041828697950812]
We propose FlexQuant, a dynamic precision-switching framework to optimize the trade-off between inference speed and accuracy.<n>Our work provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management.<n> Experimental results demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss.
arXiv Detail & Related papers (2025-05-21T07:42:53Z)
LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates [27.022532404557264]
We propose LLMBRACES, a method that computes relevance scores associated with value vectors in FFN layers.<n>By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs.<n>LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
arXiv Detail & Related papers (2025-03-20T16:55:26Z)
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system.<n>The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives.<n>For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv Detail & Related papers (2024-12-24T02:27:44Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality.<n>We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.<n>Our experiments demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
Cloud-Device Collaborative Learning for Multimodal Large Language Models [24.65882336700547]
We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
arXiv Detail & Related papers (2023-12-26T18:46:14Z)
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI [10.82017289243097]
Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. m-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
arXiv Detail & Related papers (2023-12-13T04:08:59Z)
MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM) This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature. We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization [36.43412404616356]
We propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features. SWEM combines intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm. Experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3% $mathcalJ&mathcalF$ on DAVIS 2017 validation dataset)
arXiv Detail & Related papers (2022-08-22T08:03:59Z)
SlimFL: Federated Learning with Superposition Coding over Slimmable Neural Networks [56.68149211499535]
Federated learning (FL) is a key enabler for efficient communication and computing leveraging devices' distributed computing capabilities. This paper proposes a novel learning framework by integrating FL and width-adjustable slimmable neural networks (SNNs) We propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models.
arXiv Detail & Related papers (2022-03-26T15:06:13Z)
Joint Superposition Coding and Training for Federated Learning over Multi-Width Neural Networks [52.93232352968347]
This paper aims to integrate two synergetic technologies, federated learning (FL) and width-adjustable slimmable neural network (SNN) FL preserves data privacy by exchanging the locally trained models of mobile devices. SNNs are however non-trivial, particularly under wireless connections with time-varying channel conditions. We propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models.
arXiv Detail & Related papers (2021-12-05T11:17:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.