Related papers: Online Resource Allocation with Average Budget Constraints

Online Resource Allocation with Average Budget Constraints

URL: http://arxiv.org/abs/2402.11425v5
Date: Fri, 26 Sep 2025 03:06:32 GMT
Title: Online Resource Allocation with Average Budget Constraints
Authors: Ruicheng Ao, Hongyu Chen, David Simchi-Levi, Feng Zhu,
Abstract summary: We consider the problem of online resource allocation with average budget constraints.<n>We show that a simple policy achieves a $Omega(sqrtT)$ regret.<n>We propose a novel policy that incorporates budget safety buffers.
Score: 17.47923923731483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of online resource allocation with average budget constraints. At each time point the decision maker makes an irrevocable decision of whether to accept or reject a request before the next request arrives with the goal to maximize the cumulative rewards. In contrast to existing literature requiring the total resource consumption is below a certain level, we require the average resource consumption per accepted request does not exceed a given threshold. This problem can be casted as an online knapsack problem with exogenous random budget replenishment, and can find applications in various fields such as online anomaly detection, sequential advertising, and per-capita public service providers. We start with general arrival distributions and show that a simple policy achieves a $O(\sqrt{T})$ regret. We complement the result by showing that such a regret growing rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\Omega(\sqrt{T})$ or even a $\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over accept arrivals, we propose a novel policy that incorporates budget safety buffers. It turns out that a little more safety can greatly enhance efficiency -- small additional logarithmic buffers suffice to reduce the regret from $\Omega(\sqrt{T})$ or even $\Omega(T)$ to $O(\ln^2 T)$. From a practical perspective, we extend the policy to the scenario with continuous arrival distributions, time-dependent information structures, as well as unknown $T$. We conduct both synthetic experiments and empirical applications on a time series data of New York City taxi passengers to validate the performance of our proposed policies.

Related papers

Non-Stationary Online Resource Allocation: Learning from a Single Sample [5.81028169940199]
We study online resource allocation under non-stationary demand with a minimum offline data requirement.<n>We propose a novel type-dependent quantile-based meta-policy that decouples the problem into modular components.
arXiv Detail & Related papers (2026-02-20T10:07:35Z)
Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits [6.402634424631123]
We show that a natural finite-horizon LP-update policy with randomized rounding achieves an $O(log Nsqrt1/N)$ optimality gap in infinite time average reward problems.<n>Our results draw on techniques from the model predictive control literature by invoking the concept of emphdissipativity.
arXiv Detail & Related papers (2025-11-11T10:53:49Z)
No-Regret Learning Under Adversarial Resource Constraints: A Spending Plan Is All You Need! [56.80767500991973]
We focus on two canonical settings: $(i)$ online resource allocation where rewards and costs are observed before action selection, and $(ii)$ online learning with resource constraints where they are observed after action selection, under full feedback or bandit feedback.<n>It is well known that achieving sublinear regret in these settings is impossible when reward and cost distributions may change arbitrarily over time.<n>We design general (primal-)dual methods that achieve sublinear regret with respect to baselines that follow the spending plan. Crucially, the performance of our algorithms improves when the spending plan ensures a well-balanced distribution of the budget
arXiv Detail & Related papers (2025-06-16T08:42:31Z)
Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z)
Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift. We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z)
Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally. In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z)
Online Policy Learning and Inference by Matrix Completion [12.527541242185404]
We formulate the problem as a matrix completion bandit (MCB) We explore the $epsilon$-greedy bandit and the online gradient descent descent. A faster decaying exploration yields smaller regret but learns the optimal policy less accurately.
arXiv Detail & Related papers (2024-04-26T13:19:27Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Online Optimization for Randomized Network Resource Allocation with Long-Term Constraints [0.610240618821149]
We study an optimal online resource reservation problem in a simple communication network. We propose an online saddle-point algorithm for which we present an upper bound for the associated K-benchmark regret.
arXiv Detail & Related papers (2023-05-24T20:47:17Z)
Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z)
Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits [81.60136088841948]
We propose an algorithm that minimizes the regret over the horizon of time $T$. The proposed algorithm is applicable to domains such as recommendation systems and transportation.
arXiv Detail & Related papers (2023-01-31T03:49:00Z)
Anytime-valid off-policy inference for contextual bandits [34.721189269616175]
Contextual bandit algorithms map observed contexts $X_t$ to actions $A_t$ over time. It is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data. We present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works.
arXiv Detail & Related papers (2022-10-19T17:57:53Z)
A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits [27.058087400790555]
We study the multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies.
arXiv Detail & Related papers (2022-06-07T02:10:30Z)
Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting. We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z)
Online Allocation with Two-sided Resource Constraints [44.5635910908944]
We consider an online allocation problem subject to lower and upper resource constraints, where the requests arrive sequentially. We propose a new algorithm that obtains $1-O(fracepsilonalpha-epsilon)$ -competitive ratio for the offline problems that know the entire requests ahead of time.
arXiv Detail & Related papers (2021-12-28T02:21:06Z)
Dynamic Planning and Learning under Recovering Rewards [18.829837839926956]
We propose, construct and prove performance guarantees for a class of "Purely Periodic Policies" Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications.
arXiv Detail & Related papers (2021-06-28T15:40:07Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times. In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z)
Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$. Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.