Related papers: Beyond Real Weights: Hypercomplex Representations for Stable Quantization

Beyond Real Weights: Hypercomplex Representations for Stable Quantization

URL: http://arxiv.org/abs/2512.08524v1
Date: Tue, 09 Dec 2025 12:10:57 GMT
Title: Beyond Real Weights: Hypercomplex Representations for Stable Quantization
Authors: Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman,
Abstract summary: Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations.<n>We introduce a progressive re parameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks.<n>A residual schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training.
Score: 6.708338010963415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.

Related papers

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models [23.184422544852108]
Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks.<n>Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors.<n>We propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain.
arXiv Detail & Related papers (2026-01-28T09:39:10Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
Highly Efficient and Effective LLMs with Multi-Boolean Architectures [5.346271362401715]
Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs)<n>Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency.<n>We propose a novel framework that represents LLMs with multi- kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights.
arXiv Detail & Related papers (2025-05-28T19:40:34Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay [20.688382669309096]
p-MoD is an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance.<n>We adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing)
arXiv Detail & Related papers (2024-12-05T18:58:03Z)
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations [21.229296254354878]
We introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules. Results verify the optimality of our approach at high compression with respect to both efficiency and performance.
arXiv Detail & Related papers (2024-07-08T07:45:38Z)
Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning [1.8549313085249324]
This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions. The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%.
arXiv Detail & Related papers (2024-06-07T01:30:21Z)
Adaptive Multi-Scale Decomposition Framework for Time Series Forecasting [26.141054975797868]
We propose a novel Adaptive Multi-Scale Decomposition (AMD) framework for time series forecasting.<n>Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block.<n>Our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration.
arXiv Detail & Related papers (2024-06-06T05:27:33Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
Normalizing Flows with Multi-Scale Autoregressive Priors [131.895570212956]
We introduce channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR) Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. We show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
arXiv Detail & Related papers (2020-04-08T09:07:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.