Related papers: SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

URL: http://arxiv.org/abs/2603.05232v1
Date: Thu, 05 Mar 2026 14:49:16 GMT
Title: SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity
Authors: Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei,
Abstract summary: NVIDIA's 2:4 Sparse Cores deliver 2x throughput but demand strict 50% pruning.<n>Milder $(2N-2):2N$ patterns preserve accuracy yet receive no hardware support.<n>We present SlideSparse, the first system to unlock Sparse Core acceleration.
Score: 86.71343842875878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

Related papers

FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference [0.8749675983608171]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2025-12-31T08:27:40Z)
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design [13.062940916273973]
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs.<n>Existing INT4/INT8 quantization reduces these costs, but they often degrade accuracy or lack optimal efficiency.<n>We propose FlexQ, a novel framework combining algorithmic innovation with system-level evaluations.
arXiv Detail & Related papers (2025-08-06T12:47:05Z)
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [34.548270527357126]
We propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques.<n>Our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation.
arXiv Detail & Related papers (2024-11-17T04:35:49Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [61.474101404805545]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges.<n>We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation.<n>We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [20.113352600259226]
We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.<n>Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models.
arXiv Detail & Related papers (2024-09-13T08:29:36Z)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference.<n>We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.<n>QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z)
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration [138.24994198567794]
iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT) Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss.
arXiv Detail & Related papers (2022-11-23T06:56:12Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
1$\ imes$N Block Pattern for Network Sparsity [90.43191747596491]
We propose one novel concept of $1times N$ block sparsity pattern (block pruning) to break this limitation. Our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2. It also obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning.
arXiv Detail & Related papers (2021-05-31T05:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.