Highly Efficient and Effective LLMs with Multi-Boolean Architectures
- URL: http://arxiv.org/abs/2505.22811v2
- Date: Fri, 03 Oct 2025 15:53:05 GMT
- Title: Highly Efficient and Effective LLMs with Multi-Boolean Architectures
- Authors: Ba-Hien Tran, Van Minh Nguyen,
- Abstract summary: Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs)<n>Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency.<n>We propose a novel framework that represents LLMs with multi- kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights.
- Score: 5.346271362401715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.
Related papers
- Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs [10.443777669301983]
Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval.<n>But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs.<n>We propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding.
arXiv Detail & Related papers (2026-02-05T04:01:01Z) - Beyond Real Weights: Hypercomplex Representations for Stable Quantization [6.708338010963415]
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations.<n>We introduce a progressive re parameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks.<n>A residual schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training.
arXiv Detail & Related papers (2025-12-09T12:10:57Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z) - Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [54.22256089592864]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute.<n>Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training compression framework.<n>We show that PBS$2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy.
arXiv Detail & Related papers (2025-02-03T13:30:29Z) - STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs [28.70239743254508]
We present the first structural binarization method for LLM compression to less than 1-bit precision.
We observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation.
Our approach performs better than other compressed binarization methods while significantly reducing memory requirements.
arXiv Detail & Related papers (2024-08-03T15:07:44Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models.<n>AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting [26.931958430968024]
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information.<n>This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting.<n>By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to $sim$30B-sized LLMs on $8times$H100 GPU.
arXiv Detail & Related papers (2024-06-28T15:03:08Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.