Related papers: Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

URL: http://arxiv.org/abs/2603.05168v1
Date: Thu, 05 Mar 2026 13:37:50 GMT
Title: Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Authors: Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao, Yingbo Hao, Zewen Chi, Li Dong, Ting Song, Yan Xia, Zhifang Sui, Furu Wei,
Abstract summary: We show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models.<n>We propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification.
Score: 100.07626315557599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet

Related papers

BitNet Distillation [90.71353956177705]
We present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs into 1.58-bit precision.<n>BitDistill achieves strong task-specific performance with minimal computational cost.
arXiv Detail & Related papers (2025-10-15T18:28:12Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale.<n>Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs.<n>Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference.<n>Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z)
Physics-Inspired Binary Neural Networks: Interpretable Compression with Theoretical Guarantees [20.854288216118423]
Many inverse problems admit algorithm-unrolled networks that naturally encode physics and sparsity.<n>We propose a Physics-Inspired Binary Neural Network (PIBiNN) that combines data-driven one-bit quantization with a single global scale.<n>This design yields compression rates below one bit per weight by exploiting structural zeros.
arXiv Detail & Related papers (2025-02-04T00:53:10Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.<n>Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity [10.89385369643021]
We present a novel mobile inference acceleration framework SparseByteNN. We show that for 30% sparse MobileNet-v1, SparseByteNN achieves 1.27x speedup over the dense version and 1.29x speedup over the state-of-the-art sparse inference engine MNN with a slight accuracy drop of 0.224%.
arXiv Detail & Related papers (2023-10-30T13:08:48Z)
Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed. We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z)
Elastic-Link for Binarized Neural Network [9.83865304744923]
"Elastic-Link" (EL) module enrich information flow within a BNN by adaptively adding real-valued input features to the subsequent convolutional output features. EL produces a significant improvement on the challenging large-scale ImageNet dataset. With the integration of ReActNet, it yields a new state-of-the-art result of 71.9% top-1 accuracy.
arXiv Detail & Related papers (2021-12-19T13:49:29Z)
FTBNN: Rethinking Non-linearity for 1-bit CNNs and Going Beyond [23.5996182207431]
We show that binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. We re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-10-19T08:11:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.