Related papers: LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

URL: http://arxiv.org/abs/2601.21623v1
Date: Thu, 29 Jan 2026 12:26:00 GMT
Title: LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
Authors: Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen,
Abstract summary: This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference.<n>We provide an adaptive strategy that selects a small subset of components of $g(mathrmx)$ to be computed more accurately while all other computations can be carried out with lower accuracy.<n>We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
Score: 2.845351470902218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.

Related papers

Rate-Distortion Optimization for Transformer Inference [1.5378391391800512]
Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference.<n>We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off against accuracy.
arXiv Detail & Related papers (2026-01-29T17:12:46Z)
Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
The Area Under the ROC Curve (AUC) is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints.<n>We present two simple instance-wise minimax reformulations to close the approximation gap of PAUC optimization.<n>The resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(-2/3)$ for typical one-way and two-way PAUCs.
arXiv Detail & Related papers (2025-12-01T02:52:33Z)
Transformers Meet In-Context Learning: A Universal Approximation Theory [25.513848079509653]
We develop a universal approximation theory to elucidate how transformers enable in-context learning.<n>For a general class of functions, we demonstrate how to construct a transformer that can predict based on a few noisy in-context examples.
arXiv Detail & Related papers (2025-06-05T16:12:51Z)
Mixed precision accumulation for neural network inference guided by componentwise forward error analysis [2.4374097382908477]
We propose a mathematically founded mixed precision accumulation strategy for inference of neural networks.<n>Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks.
arXiv Detail & Related papers (2025-03-19T09:19:11Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning. deploying LLM inference poses challenges due to the high compute and memory requirements. We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z)
Limits of Transformer Language Models on Learning to Compose Algorithms [77.2443883991608]
We evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient.
arXiv Detail & Related papers (2024-02-08T16:23:29Z)
Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Quasi-parametric rates for Sparse Multivariate Functional Principal Components Analysis [0.0]
We show that the eigenelements can be expressed as the solution to an optimization problem. We establish a minimax lower bound on the mean square reconstruction error of the eigenelement, which proves that the procedure has an optimal variance in the minimax sense.
arXiv Detail & Related papers (2022-12-19T13:17:57Z)
Square Root Bundle Adjustment for Large-Scale Reconstruction [56.44094187152862]
We propose a new formulation for the bundle adjustment problem which relies on nullspace marginalization of landmark variables by QR decomposition. Our approach, which we call square root bundle adjustment, is algebraically equivalent to the commonly used Schur complement trick. We show in real-world experiments with the BAL datasets that even in single precision the proposed solver achieves on average equally accurate solutions.
arXiv Detail & Related papers (2021-03-02T16:26:20Z)
Efficient Learning of Generative Models via Finite-Difference Score Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference. Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z)
Finding the optimal cluster state configuration. Minimization of one-way quantum computation errors [0.0]
From all possible cluster state configurations, we choose those that give the smallest error. We find the optimal strategy for the implementation of universal Gaussian computations with minimal errors.
arXiv Detail & Related papers (2020-03-20T10:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.