Related papers: Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

URL: http://arxiv.org/abs/2601.10274v1
Date: Thu, 15 Jan 2026 10:47:11 GMT
Title: Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
Authors: Emre Ozbas, Melih Bastopcu,
Abstract summary: We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types.<n>For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query.<n>We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time.
Score: 4.3400407844814985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M/G/1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.

Related papers

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z)
Online Fair Allocation of Perishable Resources [1.4952056744888913]
We consider a practically motivated variant of the canonical online fair allocation problem. A decision-maker has a budget of perishable resources to allocate over a fixed number of rounds. The goal is to construct a sequence of allocations that is envy-free and efficient.
arXiv Detail & Related papers (2024-06-04T15:14:10Z)
DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server. To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z)
Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z)
Decentralized Multi-Task Online Convex Optimization Under Random Link Failures [5.513958040574729]
We develop a robust decentralized saddle-point algorithm against random link failures with heterogeneous probabilities. We extend our algorithm and analysis to the two-point bandit feedback scenario.
arXiv Detail & Related papers (2024-01-04T00:57:33Z)
Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems [62.83783246648714]
We propose a sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with objectives and deterministic equality constraints. The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices.
arXiv Detail & Related papers (2022-11-29T05:52:17Z)
Optimal Admission Control for Multiclass Queues with Time-Varying Arrival Rates via State Abstraction [16.99621896314678]
We consider a novel queuing problem where the decision-maker must choose to accept or reject randomly arriving tasks. The objective is to decide which tasks to accept so that the total price of tasks processed is maximised over a finite horizon. We show that the optimal value function has a specific structure, which enables us to solve the hybrid MDP exactly.
arXiv Detail & Related papers (2022-03-14T12:38:13Z)
Learning to Schedule in Parallel-Server Queues with Stochastic Bilinear Rewards [7.519872646378837]
We consider the problem of scheduling in multi-class, parallel-server systems with uncertain rewards from job-server assignments.<n>Our objective is to minimize regret by maximizing the cumulative reward of job-server assignments over a time horizon.<n>Our algorithm achieves a sublinear regret bound and a sublinear mean holding cost.
arXiv Detail & Related papers (2021-12-13T00:37:20Z)
Evolutionary Optimization of High-Coverage Budgeted Classifiers [1.7767466724342065]
Budgeted multi-feature classifiers (MSC) process inputs through a sequence of partial feature acquisition and evaluation steps. This paper proposes a problem-specific MSC that incorporates a terminal reject option for indecisive predictions. The algorithm's design emphasizes efficiency while respecting a notion of aggregated performance via a uniqueization.
arXiv Detail & Related papers (2021-10-25T16:03:07Z)
Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
Online Stochastic Optimization with Wasserstein Based Non-stationarity [12.91020811577007]
We consider a general online optimization problem with multiple budget constraints over a horizon of finite time periods. The objective of the decision maker is to maximize the cumulative reward subject to the budget constraints. This formulation captures a wide range of applications including online linear programming and network revenue management.
arXiv Detail & Related papers (2020-12-13T04:47:37Z)
Combining Deep Learning and Optimization for Security-Constrained Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems. Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs. This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.