More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization
- URL: http://arxiv.org/abs/2512.24545v1
- Date: Wed, 31 Dec 2025 01:04:34 GMT
- Title: More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization
- Authors: Yuma Ichikawa, Yoshihiko Fujisawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa,
- Abstract summary: We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope.<n>MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude.<n>Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight.
- Score: 5.790458475928127
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.
Related papers
- SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models [4.269807933198402]
Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets.<n>We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models.
arXiv Detail & Related papers (2026-02-01T05:24:19Z) - DP-FEDSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix [0.0611737116137921]
Differentially private federated learning (DP-FL) suffers from slow convergence under tight privacy budgets due to the overwhelming noise introduced to preserve privacy.<n>We propose DP-FedSOFIM, a server-side second-order optimization framework that leverages the Fisher Information Matrix (FIM) as a natural preconditioner while requiring only O(d) memory per client.<n>Our analysis proves that the server-side preconditioning preserves (epsilon, delta)-differential privacy through the post-processing theorem.
arXiv Detail & Related papers (2026-01-14T05:11:28Z) - From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z) - Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing [8.705453442427585]
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks.<n>Their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding.<n>This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices.
arXiv Detail & Related papers (2025-11-06T02:55:07Z) - Efficient and Privacy-Preserving Binary Dot Product via Multi-Party Computation [4.336006969179338]
This paper proposes a novel binary multi-party computation (BiMPC) framework for bitwise operations.<n>The core of BiMPC is a novel approach called Dot Product via Modular Addition (DoMA), which uses regular and modular additions for efficient binary dot product calculation.<n>The privacy guarantees of the BiMPC framework are rigorously analyzed, demonstrating its efficiency and scalability in distributed settings.
arXiv Detail & Related papers (2025-10-18T03:35:42Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Addition is almost all you need: Compressing neural networks with double binary factorization [0.0]
Double Binary Factorization (DBF) is a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors.<n>DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods.<n>In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP.
arXiv Detail & Related papers (2025-05-16T10:07:36Z) - BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution [63.777210548110425]
We propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration.<n>BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart.
arXiv Detail & Related papers (2025-02-01T06:34:55Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - Improving Misaligned Multi-modality Image Fusion with One-stage
Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion.
We propose a Cross-modality Multi-scale Progressive Dense Registration scheme.
This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z) - Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.