Related papers: Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

URL: http://arxiv.org/abs/2505.22811v1
Date: Wed, 28 May 2025 19:40:34 GMT
Title: Highly Efficient and Effective LLMs with Multi-Boolean Architectures
Authors: Ba-Hien Tran, Van Minh Nguyen,
Abstract summary: Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs)<n>We introduce a novel framework that effectively transforms LLMs into multi- kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights.<n>Our method outperforms recent ultra low-bit quantization and binarization methods.
Score: 1.4195677954898822
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to significant loss of information from the original LLMs, resulting in poor performance. The second approach, on the other hand, relies heavily on full-precision latent weights for gradient approximation of binary weights, which not only remains suboptimal but also introduces substantial complexity. In this paper, we introduce a novel framework that effectively transforms LLMs into multi-kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights. This significantly reduces complexity during both finetuning and inference. Through extensive and insightful experiments across a wide range of LLMs, we demonstrate that our method outperforms recent ultra low-bit quantization and binarization methods.

Related papers

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training compression framework.<n>We show that PBS$2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs [28.70239743254508]
We present the first structural binarization method for LLM compression to less than 1-bit precision. We observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation. Our approach performs better than other compressed binarization methods while significantly reducing memory requirements.
arXiv Detail & Related papers (2024-08-03T15:07:44Z)
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting [26.931958430968024]
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information.<n>This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting.<n>By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to $sim$30B-sized LLMs on $8times$H100 GPU.
arXiv Detail & Related papers (2024-06-28T15:03:08Z)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing. Existing ultra-low-bit quantization always causes severe accuracy drops. We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.