Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty
- URL: http://arxiv.org/abs/2602.07958v1
- Date: Sun, 08 Feb 2026 13:03:03 GMT
- Title: Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty
- Authors: Yumin Kim, Hyeonsu Lyu, Minjae Lee, Hyun Jong Yang,
- Abstract summary: Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices.<n>Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing.<n>We propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES.
- Score: 9.403735095944747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA consistently achieves a favorable trade-off, outperforming baseline strategies in both accuracy and latency across varying user densities, and operates with practical computation time. These results establish GOA as a scalable and effective solution for LLM inference in MEC environments.
Related papers
- Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding [6.130486652666936]
Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers.<n>This paper is the first to propose a unified framework that jointly optimize user association and resource allocation to support efficient parallel speculative decoding.<n>Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy.
arXiv Detail & Related papers (2025-11-03T16:04:44Z) - EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs [0.36050743818632486]
Large language models (LLMs) for reasoning tasks on edge GPU face critical challenges from strict latency constraints and limited computational resources.<n>To navigate these constraints, developers must balance reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies.<n>We present EdgeReasoning, a comprehensive study characterizing the deployment of reasoning LLMs on edge GPUs.
arXiv Detail & Related papers (2025-10-21T04:18:25Z) - Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - Adaptive Stream Processing on Edge Devices through Active Inference [5.5676731834895765]
We present a novel Machine Learning paradigm based on Active Inference (AIF)
AIF describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise.
Our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
arXiv Detail & Related papers (2024-09-26T15:12:41Z) - Active Inference on the Edge: A Design Study [5.815300670677979]
Active Inference (ACI) is a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise.
We show how our ACI agent was able to quickly and traceably solve an optimization problem while fulfilling requirements.
arXiv Detail & Related papers (2023-11-17T16:03:04Z) - Differentially Private Deep Q-Learning for Pattern Privacy Preservation
in MEC Offloading [76.0572817182483]
attackers may eavesdrop on the offloading decisions to infer the edge server's (ES's) queue information and users' usage patterns.
We propose an offloading strategy which jointly minimizes the latency, ES's energy consumption, and task dropping rate, while preserving pattern privacy (PP)
We develop a Differential Privacy Deep Q-learning based Offloading (DP-DQO) algorithm to solve this problem while addressing the PP issue by injecting noise into the generated offloading decisions.
arXiv Detail & Related papers (2023-02-09T12:50:18Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [54.34189781923818]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.<n>We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.<n>We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Real-Time Edge Classification: Optimal Offloading under Token Bucket
Constraints [13.583977689847433]
We introduce a Markov Decision Process-based framework to make offload decisions under strict latency constraints.
We also propose approaches to allow multiple devices connected to the same access switch to share their bursting allocation.
We evaluate and analyze the policies derived using our framework on the standard ImageNet image classification benchmark.
arXiv Detail & Related papers (2020-10-26T17:25:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.