Related papers: Efficient Large Language Model Inference with Neural Block Linearization

Efficient Large Language Model Inference with Neural Block Linearization

URL: http://arxiv.org/abs/2505.21077v2
Date: Sun, 19 Oct 2025 21:33:03 GMT
Title: Efficient Large Language Model Inference with Neural Block Linearization
Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher,
Abstract summary: We introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference.<n>NBL replaces self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators.<n>In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy.
Score: 51.619870789584525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference [17.605039499074074]
We propose a calibration-free, dynamic-programming-optimal framework called Non-uniform Linear Interpolation (NLI)<n>NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into large language models.<n> Hardware experiments demonstrate that the NLI Engine achieves more than 4x improvement in computational efficiency.
arXiv Detail & Related papers (2026-02-03T01:47:58Z)
FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching [4.01326804806241]
We introduce Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC)<n>R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer.<n>BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy.
arXiv Detail & Related papers (2026-01-09T10:06:45Z)
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models [53.339700196282905]
A key challenge in applying reinforcement learning to large language models (dLLMs) is the intractability of their likelihood functions.<n>We propose a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective.<n> Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
arXiv Detail & Related papers (2025-10-13T17:47:50Z)
ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement [45.01372641622595]
We present ImCoref-CeS, a novel framework that integrates an enhanced supervised model with Large Language Models (LLMs)-based reasoning.<n>First, we present an improved CR method (textbfImCoref) to push the performance boundaries of the supervised neural method.<n>We employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions and coreference results.
arXiv Detail & Related papers (2025-10-11T14:48:08Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Rational Tuning of LLM Cascades via Probabilistic Modeling [0.9208007322096532]
We present a probabilistic model for the joint performance distribution of a sequence of large language models (LLMs)<n>Compared to selecting confidence thresholds using grid search, our model significantly improves runtime scaling with respect to the length of the cascade and the desired resolution of the cost-error curve.
arXiv Detail & Related papers (2025-01-16T07:58:33Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
A least distance estimator for a multivariate regression model using deep neural networks [1.8876415010297893]
(A)GDNN-LD estimator enjoys variable selection and model estimation simultaneously, by applying the (adaptive) group Lasso penalty to weight parameters in the DNN structure. For the computation, we propose a quadratic smoothing approximation method to facilitate optimizing the non-smooth objective function based on the least distance loss.
arXiv Detail & Related papers (2024-01-06T04:36:00Z)
Hyperparameter Estimation for Sparse Bayesian Learning Models [1.0172874946490507]
Aparse Bayesian Learning (SBL) models are extensively used in signal processing and machine learning for promoting sparsity through hierarchical priors. This paper presents a framework for the improvement of SBL models for various objective functions. A novel algorithm is introduced showing enhanced efficiency, especially under signal noise ratios.
arXiv Detail & Related papers (2024-01-04T21:24:01Z)
SHOT: Suppressing the Hessian along the Optimization Trajectory for Gradient-Based Meta-Learning [28.26143547479141]
We introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory) SHOT does not increase the computational complexity of the baseline model much. We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline.
arXiv Detail & Related papers (2023-10-04T11:43:08Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.