DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
- URL: http://arxiv.org/abs/2508.06041v1
- Date: Fri, 08 Aug 2025 05:57:04 GMT
- Title: DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
- Authors: Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park,
- Abstract summary: DP-LLM is a novel mechanism that dynamically assigns precision to each layer based on input values.<n>We show that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
- Score: 4.048600072986694
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
Related papers
- LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution [52.627063566555194]
We introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR.<n>Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations.<n>Our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques.
arXiv Detail & Related papers (2026-02-03T06:53:19Z) - AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference [1.1852406625172216]
We propose Adaptive Speculative Decoding (AdaSD) for large language models (LLMs)<n>AdaSD dynamically adjusts generation length and acceptance criteria during inference.<n> Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding.
arXiv Detail & Related papers (2025-12-12T04:56:08Z) - OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs [21.55040910903597]
OTARo is a novel method that enables on-device Large Language Models to flexibly switch quantization precisions.<n>It achieves consistently strong and robust performance for all precisions.
arXiv Detail & Related papers (2025-11-17T08:56:27Z) - Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation [49.98025799046136]
We introduce Merge-And-GuidE, a two-stage framework that leverages model merging for guided decoding.<n>In Stage 1, MAGE resolves a compatibility problem between the guidance and base models.<n>In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1.
arXiv Detail & Related papers (2025-10-04T11:10:07Z) - Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z) - KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization [20.230236656479207]
Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)<n>We introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs.
arXiv Detail & Related papers (2025-05-22T03:04:47Z) - LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks.<n>We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.<n>We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [70.37707797523723]
We build a novel dynamic feature aggregation (DFA) module that can adapt kernel weights and receptive fields at different timestamps.
Using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters.
DyFADet, a new dynamic TAD model, achieves promising performance on a series of challenging TAD benchmarks.
arXiv Detail & Related papers (2024-07-03T15:29:10Z) - AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation [32.74923906921339]
Diffusion models achieve great success in generating diverse and high-fidelity images, yet their widespread application is hampered by their inherently slow generation speed.
We propose AdaDiff, an adaptive framework that dynamically allocates computation resources in each sampling step to improve the generation efficiency of diffusion models.
arXiv Detail & Related papers (2023-09-29T09:10:04Z) - Mixed-Precision Quantization for Deep Vision Models with Integer Quadratic Programming [7.0146264551420066]
Quantization is a widely used technique to compress neural networks.<n>MPQ addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off.<n>We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures crosslayer dependency of quantization error.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.