NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference
- URL: http://arxiv.org/abs/2602.02988v1
- Date: Tue, 03 Feb 2026 01:47:58 GMT
- Title: NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference
- Authors: Jiangyong Yu, Xiaomeng Han, Xing Hu, Chen Xu, Zhe Jiang, Dawei Yang,
- Abstract summary: We propose a calibration-free, dynamic-programming-optimal framework called Non-uniform Linear Interpolation (NLI)<n>NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into large language models.<n> Hardware experiments demonstrate that the NLI Engine achieves more than 4x improvement in computational efficiency.
- Score: 17.605039499074074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, but their deployment is often constrained by substantial memory footprints and computational costs. While prior work has achieved significant progress in compressing and accelerating linear layers, nonlinear layers-such as SiLU, RMSNorm, and Softmax-still heavily depend on high-precision floating-point operations. In this paper, we propose a calibration-free, dynamic-programming-optimal, and hardware-friendly framework called Non-uniform Linear Interpolation (NLI). NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into LLMs and other deep neural networks with almost no loss in accuracy. NLI ingeniously recasts cutpoint selection as a dynamic-programming problem, achieving the globally minimal interpolation error in O(MxN2) time via Bellman's optimality principle. Based on the NLI algorithm, we also design and implement a plug-and-play universal nonlinear computation unit. Hardware experiments demonstrate that the NLI Engine achieves more than 4x improvement in computational efficiency compared to the state-of-the-art designs.
Related papers
- Enforcing convex constraints in Graph Neural Networks [0.0]
In this paper, we introduce ProjNet, a Graph Neural Network framework which satisfies input-dependant constraints.<n>We establish a convergence result for CAD and develop a Neural-accelerated implementation of handling large-scale inputs efficiently.
arXiv Detail & Related papers (2025-10-13T10:10:46Z) - Rethinking Nonlinearity: Trainable Gaussian Mixture Modules for Modern Neural Architectures [0.9778425765923312]
We introduce a new class of differentiable modules that draw on the universal density approximation Gaussian mixture models (GMMs)<n>By relaxing probabilistic constraints, GMNM can be seamlessly integrated into diverse neural architectures and trained end-to-end methods.<n>Our experiments demonstrate GMNM as a powerful and flexible module for enhancing efficiency and accuracy across a wide range of machine learning applications.
arXiv Detail & Related papers (2025-10-08T05:20:34Z) - MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z) - Efficient Large Language Model Inference with Neural Block Linearization [51.619870789584525]
We introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference.<n>NBL replaces self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators.<n>In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy.
arXiv Detail & Related papers (2025-05-27T12:01:43Z) - Langevin Multiplicative Weights Update with Applications in Polynomial Portfolio Management [14.310970006771717]
We show that LMvinvin based gradient local minima with a non-asymptotic convergence analysis.<n>We show that LMvinvin algorithm is provably convergent global minima with a non-asymptotic convergence analysis.
arXiv Detail & Related papers (2025-02-26T15:13:08Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - NN-LUT: Neural Approximation of Non-Linear Operations for Efficient
Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models.
This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z) - Neural Spectrahedra and Semidefinite Lifts: Global Convex Optimization
of Polynomial Activation Neural Networks in Fully Polynomial-Time [31.94590517036704]
We develop exact convex optimization formulations for two-layer numerical networks with second degree activations.
We show that semidefinite neural and therefore global optimization is in complexity dimension and sample size for all input data.
The proposed approach is significantly faster to obtain better test accuracy compared to the standard backpropagation procedure.
arXiv Detail & Related papers (2021-01-07T08:43:01Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.