Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
- URL: http://arxiv.org/abs/2601.10274v1
- Date: Thu, 15 Jan 2026 10:47:11 GMT
- Title: Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
- Authors: Emre Ozbas, Melih Bastopcu,
- Abstract summary: We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types.<n>For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query.<n>We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time.
- Score: 4.3400407844814985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M/G/1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.
Related papers
- Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - Online Fair Allocation of Perishable Resources [1.4952056744888913]
We consider a practically motivated variant of the canonical online fair allocation problem.
A decision-maker has a budget of perishable resources to allocate over a fixed number of rounds.
The goal is to construct a sequence of allocations that is envy-free and efficient.
arXiv Detail & Related papers (2024-06-04T15:14:10Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Decentralized Multi-Task Online Convex Optimization Under Random Link
Failures [5.513958040574729]
We develop a robust decentralized saddle-point algorithm against random link failures with heterogeneous probabilities.
We extend our algorithm and analysis to the two-point bandit feedback scenario.
arXiv Detail & Related papers (2024-01-04T00:57:33Z) - Fully Stochastic Trust-Region Sequential Quadratic Programming for
Equality-Constrained Optimization Problems [62.83783246648714]
We propose a sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with objectives and deterministic equality constraints.
The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices.
arXiv Detail & Related papers (2022-11-29T05:52:17Z) - Optimal Admission Control for Multiclass Queues with Time-Varying
Arrival Rates via State Abstraction [16.99621896314678]
We consider a novel queuing problem where the decision-maker must choose to accept or reject randomly arriving tasks.
The objective is to decide which tasks to accept so that the total price of tasks processed is maximised over a finite horizon.
We show that the optimal value function has a specific structure, which enables us to solve the hybrid MDP exactly.
arXiv Detail & Related papers (2022-03-14T12:38:13Z) - Learning to Schedule in Parallel-Server Queues with Stochastic Bilinear Rewards [7.519872646378837]
We consider the problem of scheduling in multi-class, parallel-server systems with uncertain rewards from job-server assignments.<n>Our objective is to minimize regret by maximizing the cumulative reward of job-server assignments over a time horizon.<n>Our algorithm achieves a sublinear regret bound and a sublinear mean holding cost.
arXiv Detail & Related papers (2021-12-13T00:37:20Z) - Evolutionary Optimization of High-Coverage Budgeted Classifiers [1.7767466724342065]
Budgeted multi-feature classifiers (MSC) process inputs through a sequence of partial feature acquisition and evaluation steps.
This paper proposes a problem-specific MSC that incorporates a terminal reject option for indecisive predictions.
The algorithm's design emphasizes efficiency while respecting a notion of aggregated performance via a uniqueization.
arXiv Detail & Related papers (2021-10-25T16:03:07Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Online Stochastic Optimization with Wasserstein Based Non-stationarity [12.91020811577007]
We consider a general online optimization problem with multiple budget constraints over a horizon of finite time periods.
The objective of the decision maker is to maximize the cumulative reward subject to the budget constraints.
This formulation captures a wide range of applications including online linear programming and network revenue management.
arXiv Detail & Related papers (2020-12-13T04:47:37Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.